Libraries used to create and generate this report:
R version 4.3.1 (2023-06-16)2.211.421.0.40.341.3.4Libraries used to analyse data:
2.3.11.0.121.4.2Libraries used to load data:
1.16.11.14.86.24.0Cytoscape is used for visualization. Figures were generated using the Cytoscape v3.9.1 and several Cytoscape apps:
1.1.31.1.6Similarity Network Fusion (SNF) builds networks of samples for each data type. Then, it fuses them into one network, which represents the full spectrum of underlying data.
In other words, SNF integrates several types of data (e.g. omics data) into one network which represents the relationships between samples.
The SNF methods can be decomposed into three main steps, displayed in the Figure 2.1 (note that patient=samples in the description below):
Figure 2.1: Similarity Network Fusion method overview. The figure is comming from Wang et al., 2014.
In the final fused similarity network (e), you can identify which data type contributes to which edge:
You can retreive more information in:
Choose the dataset on which you want to apply SNF!!
Different datasets are available. Note that each dataset has its specificity and some analysis steps should be adapted.
Figure 3.1: Four datasets are available: Metagenomic dataset from Tara Ocean (image from Sunagawa et al., 2015), Breast cancer dataset from TCGA (image from TCGA website), CLL dataset (Dietrich et al., 2018) and tomato plant dataset (figure from google image).
To retrieve data: download files from summer school’s GitHub repository
dataset:
TARAoceans_proNOGS.cvsTARAoceans_proPhylo.csvmetadata:
TARAoceans_metadata.csvSamples come from eight oceans around the world (SPO: South Pacific Ocean, NAO: North Atlantic Ocean, IO: Indian Ocean, RS: Red Sea, MS: Mediterranean Sea, NPO: North Pacific Ocean, SO: Southern Ocean, SAO: South Atlantic Ocean).
Samples can come from different layers with different temperatures:
In a previous analysis (Sunagawa et al., 2015), they identified a stratification mostly driven by the temperature rather than geography or other environmental factors.
We have two types of data:
Does an integrative analysis of these two data types retrieve the stratification driver by the layers? Does it also find a geographical clustering?
Data are coming from: MiBiOmics gitlab.
To retrieve data:
dataset: using data("breast.TCGA") from the mixOmics R package
breast.TCGA$data.train$mirnabreast.TCGA$data.train$mrnabreast.TCGA$data.train$proteinmetadata: using data("breast.TCGA") from the mixOmics R package
breast.TCGA$data.train$subtypeHuman breast cancer is a heterogeneous disease. Breast tumors can be classified into several subtypes (PAM50 classification), according to the mRNA expression level (Sorlie et al., 2001). In this dataset, we have three subtypes:
We have three types of data:
Does an integrative analysis of these three data types retrieve the classification of the breast cancer? Or find another classification?
Data are coming from the mixOmics R package. The full data can be downloaded here.
To retrieve data:
dataset: using data("CLL_data") from the MOFAdata R package
CLL_data_t$DrugsCLL_data_t$MethylationCLL_data_t$mRNACLL_data_t$Mutationsmetadata: download file from summer school’s GitHub repository
sample_metadata.txtThe Chronic Lymphocytic Leukaemia (CLL) is type of blood and bone marrow cancer. The full data are explained in Dietrich et al., 2018 and available here.
We have four types of data:
To retrieve data: download files from summer school’s GitHub repository
dataset:
mrna.tsvprots.tsvmetadata:
samples_metadata.tsvIn order to study the protein turnover in developing tomato fruit (Solanum lycopersicum) in Belouah et al., two omics data types were collected:
Each data type was collected in nine different developmental stages: GR1, GR2, GR3, GR4, GR5, GR6, GR7, GR8 and GR9. For each developmental stages, we have three replicates.
Does an integrative analysis of these data types retrieve the different developmental stages?
Data are coming from Belouah et al., 2019.
The preprocessing step is the most important part of the analysis. Data need to be prepared correctly in order to extract relevant information and produce a correct and pertinent interpretation of the results.
Data preprocessing could be summarized by four main steps:
Figure 4.1: Distribution examples expected after preprocessing
The data must conform to a specific matrix shape:
HELP!
?functionName().For this tutorial, we assume that the data have been already prepared: outliers are already removed and there is no batch effect.
To load data from a file, you can use read.table() and specify the file name, the sep character and others parameters if it’s necessary.
To load data from a package, you can use data(dataName). Don’t forget to load the corresponding package before with library(packageName).
To load data from a website, you can use fread(url) from the data.table package.
The metadata contains complementary information about samples. You can load the metadata from package or file. See the section 4.2.1 about data loading.
We use mainly the metadata for visualization. So we suggest to follow these recommendations:
data.frame() or as.data.Frame() functions)read.table() use the row.names = 1 parameter)character or numerical (check using str() function)For instance, to load the ortologous gene data from Tara Ocean, the command could be:
tara_nog <- read.table(file = "../00_Data/TaraOcean_mibiomics/TARAoceans_proNOGS.csv", sep = ",", head = TRUE, row.names = 1)
tara_nog[c(1:5), c(1:5)]## NOG317682 NOG135470 NOG85325 NOG285859 NOG147792
## TARA_109_SRF 0 2.390962e-05 0 4.663604e-08 1.800215e-07
## TARA_149_MES 0 4.339824e-06 0 5.182915e-07 4.190123e-06
## TARA_110_MES 0 1.348252e-05 0 6.000043e-07 2.218342e-07
## TARA_102_MES 0 6.380711e-06 0 3.816016e-07 0.000000e+00
## TARA_142_SRF 0 9.484144e-06 0 6.437103e-08 1.132431e-06
Don’t hesitate to look the first rows of your data regularly using head(). It could be more convenient to display only the first 5 rows and columns when the data are big (tara_nog[c(1:5), c(1:5)]).
Practice:
These functions could help you: nrow(), ncol(), lapply(), dim(), t(), names(), type(), as.character() and unique().
In the SNF paper, authors recommend to filter out samples with more than 20% of missing data in a certain data type. They also recommend to filter out features with more than 20% of missing data across samples. Then, they impute the remaining missing data using K nearest neighbors (KNN) imputation.
In this tutorial, we decide to remove samples with at least one missing data. To remove samples with missing data, we propose the following NARemoving() function. Input paramaters are:
data: the data typemargin: a vector giving the subscripts which the function will be applied over (e.g. 1 indicates rows and 2 indicates columns)threshold: threshold above which samples/features are deletedNARemoving <- function(data, margin, threshold){
#' NA removing
#'
#' Calculate percentage of na
#' Remove na from rows (margin = 1) or column (margin = 2)
#'
#' @param data data.frame.
#' @param margin int. 1 = row and 2 = column
#' @param threshold int. Number of missing data accepted
#'
#' @return Return data.frame with a specific number of na by row/column
data_na <- apply(data, MARGIN = margin, FUN = function(v){sum(is.na(v)) / length(v) * 100})
# print(table(data_na))
toRemove <- split(names(data_na[data_na > threshold]), " ")[[1]]
if(margin == 1){
data_withoutNa <- data[!(row.names(data) %in% toRemove),]
print(paste0("Remove ", as.character(length(toRemove)), " samples."))
}
if(margin == 2){
data_withoutNa <- data[,!(colnames(data) %in% toRemove)]
print(paste0("Remove ", as.character(length(toRemove)), " features"))
}
return(data_withoutNa)
}For instance, the CLL drug data contain missing data (sample H024).
## D_001_1 D_001_2 D_001_3 D_001_4 D_001_5
## H045 0.02363938 0.04623274 0.3187471 0.8237027 0.8962777
## H109 0.07359900 0.10623002 0.2732891 0.7171379 0.8850003
## H024 NA NA NA NA NA
## H056 0.05813930 0.09022028 0.2322145 0.7225736 0.7957497
## H079 0.02042077 0.04750543 0.3638962 0.8073907 0.8794886
To remove samples with missing data in drug data, we use the following command line:
data = CLL_data_t$Drugs: remove the samples with missing data in the drug datamargin = 1: samples are in rows, so we want to apply the function on the rowsthreshold = 0: we remove samples with at least one missing data## [1] "Remove 16 samples."
The H024 sample is not anymore in the CLL drug data.
## D_001_1 D_001_2 D_001_3 D_001_4 D_001_5
## H045 0.02363938 0.04623274 0.3187471 0.8237027 0.8962777
## H109 0.07359900 0.10623002 0.2732891 0.7171379 0.8850003
## H056 0.05813930 0.09022028 0.2322145 0.7225736 0.7957497
## H079 0.02042077 0.04750543 0.3638962 0.8073907 0.8794886
## H164 0.02962725 0.08054628 0.4725991 0.8179143 0.8927961
We repeat this step for each data type. Then, we filter out samples that are not present in all data type.
sampleNames <- Reduce(intersect, list(rownames(CLL_drug), rownames(CLL_mrna)))
CLL_drug <- CLL_drug[rownames(CLL_drug) %in% sampleNames,]
CLL_mrna <- CLL_mrna[rownames(CLL_mrna) %in% sampleNames,]Practice:
These functions could help you: is.na(), lapply(), table() and view().
The normalization should be adapted according the data type. For the data used here, we assumed that data have been already normalized, according their type. This step is really important, and should be correctly done before every type of data integration.
Each feature (column) needs to have the mean equals to zero and the standard deviation equals to one. For that, the SNF package provides the function standardNormalization().
In the Figure 4.2, you can see the data distribution for the breast cancer miRNA data before (on the left) and after (on the right) scaling. After scaling, the data distribution should be normal.
hist(as.matrix(tcga_mirna), nclass = 100, main = "Breast cancer miRNA data", xlab = "values")
hist(as.matrix(tcga_mirna_scaled), nclass = 100, main = "Breast cancer miRNA scaled data", xlab = "values")Figure 4.2: Breast cancer miRNA data distribution before (left) and after (right) scaling.
Practice:
standardNormalization().
These functions could help you: hist(), as.matrix().
In this section, we create a sample network for each type of data, based on the similarity between samples. The main steps are (Figure 5.1):
Figure 5.1: Similarity network creation overview. Data type 1 is in the first row, represented with the green matrices. Data type 2 is in the last row, represented with the purple matrices.
The similarity network is a sample network. It is defined as a network G with nodes (or vertices) called V and connections (or edges) called E. The connections between samples are weighted. Weights come from the similarity matrix.
We can define the similarity network G like this:
First, we calculate the distance between each pair of samples for each data type using the preprocessed data. The distance method used needs to be adapted to the feature type (e.g. continuous, discrete).
In the SNF paper (Wang et al.), authors suggest to use:
In the SNF R package, the dist2() function performs a squared Euclidean distances between samples.
Practice:
These functions could help you: as.matrix(), dim(), nrow() and ncol().
The distance matrix D is then transformed into the similarity matrix W. Distances are converted to weights using the scaled exponential similarity kernel (µ) and average distances between samples and their nearest neighbors (ε):
\[ W = exp(-\frac{D^2}{µ\varepsilon})\] In other words, distances are converted to weights according the distance with the nearest neighbors of each sample pair.
The SNF R package proposes the affinityMatrix() to calculate this similarity matrix. In this case, affinity and similarity are equivalent. This function needs three parameters:
diff: distance matrix DK: number of nearest neighbors (between 10 and 30)sigma: hyperparameter or variance (between 0.3 and 0.8)The K neighbors is used to set the similarities outside of the neighborhood to zero. The sigma parameter allows scaling exponential similarity kernel, which is used to calculated the similarity.
Then, you can visualized the similarity matrix using the pheatmap() function. This function creates an heatmap of samples. By default samples are clustered using hierarchical clustering. For a better visualization, we recommend to:
show_rownames = FALSE and show_colnames = FALSEannotationlog(x, 10))Practice:
K = 20 and sigma = 0.5 for each data type.Previously, we created a similarity matrix W and its corresponding similarity network G for each data type (Figure 6.1).
Figure 6.1: In the previous step, we create a similarity matrix that contains weights for each data type. We also created the corresponding similarity network.
Now, we integrate these similarity matrices (in the Figure 6.1 there are two data types). For that, we use an iterative fusion method. The number of iteration T, needs to be defined.
First, two matrices are created from the similarity matrix of each data type:
The P matrix carries the full information about the similarity of each samples to all others.
The S matrix carries the similarity to the K most similar samples for each sample (i.e. topology). The similarities between non-neighboring nodes are set to zero because authors assume that local similarities (high weights) are more reliable than the remote ones.
Then, the P matrix of each data type is iteratively updated with information from P matrices of the other data type, making them more similar at each step. In the Figure 6.2, you have an example with two data type. It’s a bit more complex with more data type (see the paper if you are interested in).
Figure 6.2: Example of the fusion method applied to two data types. Each data type is represented using a color: data type 1 in green and data type 2 in purple.
Finally after T iterations, the P matrices of each data type are merged together to create the final fused similarity matrix, and the corresponding fused similarity network.
Then, you can visualize the fused similarity network using Cytoscape.
First, we have to define the number of iteration called T. It should be between 10 and 20 (recommended by the authors).
Then, to perform the fusion, SNF R package proposes the SNF() function. You have to provide:
Practice:
T = 10).
These functions could help you: list(), length().
You can visualize the fused similarity network using Cytoscape.
First, you need to convert the fused similarity matrix into the corresponding fused similarity network. The igraph R package allows to create and manage networks.
You can create a the fused similarity network using the graph_from_adjacency_matrix() function. This function uses a similarity matrix to create the corresponding similarity network.
We don’t want duplicate information about connections between samples, neither connections between samples themselves (self loops). But we want to keep the connection weight values. You can use these parameters:
diag = FALSEmode = "upper"weighted = TRUEThen, you can save the fused similarity network into a edge file using write.table() function.
This is an example of the saving command line:
To create a network using Cytoscape, use the following steps:
Figure 6.3: Step 1 - Import files
*_W_edgeList.txt and define:
*_metadata.txt (Import Data as Node Table Columns)Figure 6.4: Step 2 - Change the network style
In the Style tab:
select a column from the metadata. Here it’s growth_stage from the tomato datasetMapping Type tabEllipseLock node width and heightApply your favorite layout. For this step, we recommend to try at least yFiles Organic Layout.
Practice:
In Cytoscape, you see that the fused similarity network is fully connected. Indeed, each sample is connected to all other samples. Remember, connections between samples are weighted: this weight represents the similarity between samples. A high weight value means a strong similarity between two samples whereas a low weight value means a weak similarity between two samples.
For a better visualization, we can select a threshold to decide which connections to keep, based on this weight value.
How to choose the good threshold? It’s an open question. There are some possibilities:
In this example, we select a threshold based on the weight distribution of the fused similarity network.
First, we extract edges from the network object. Indeed, the network object doesn’t contains duplicate edges, neither self loops unlike the corresponding fused similarity matrix W.
For instance, we extract weight values from the Tara Ocean fused similarity network. This network contains 9591 connections in total.
Then, we display the corresponding weight histogram in the Figure 7.1 (left). You can see a large number of small weight values. We can remove values between 0 and 0.01 where the majority of the small values seem to be. With the 0.01 threshold, 777 connections are kept.
In the Figure 7.1 (right), we log-transform weights. The weight distribution is binomial and we would like to cut the distribution into two parts. With the threshold 0.001 (log(0.001, 10) = -3), 6440 connections are kept.
## Raw weights
hist(weights, nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights")
abline(v = 0.01, col = "red", lwd = 3)
## log10 weights
hist(log(weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "log10(weights)")
abline(v = -3.1, col = "red", lwd = 3)Figure 7.1: Weight distribution of the fused similarity network. On the left the weight distribution of the fused similarity network. On the right the log-transformed weight distribution of the fused similarity network.
| Advantages | Drawbacks |
|---|---|
| 1. Easy to implement | 1. Arbitrary |
| 2. Fast | 2. Doesn’t take account of the topology of the network |
| 3. Visual |
As example for this part, we select the median and the third quantile values as thresholds. We calculate the median and the third quantile of the weight values.
The weight distribution and the log-transformed weight distribution are displayed in the Figure 7.2, respectively left and right.
## Raw weights
hist(weights, nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights")
abline(v = W_median, col = "blue", lwd = 3)
text(W_median, 2000, pos = 4, "Median", col = "blue", cex = 1)
## log10 weights
hist(log(weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "log10(weights)")
abline(v = log(W_median, 10), col = "blue", lwd = 3)
text(log(W_median, 10), 400, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(W_q75, 10), col = "purple", lwd = 3)
text(log(W_q75, 10), 350, pos = 4, "quantile 75%", col = "purple", cex = 1)Figure 7.2: Weight distribution of the fused similarity network. On the left the weight distribution of the fused similarity network. On the right the log-transformed weight distribution of the fused similarity network.
The median is the value that splits data into two groups with the same number of data. With this value (0.002143), we selected 4795 connections.
The values above the third quantile value (0.0041194) are in the top of 25% highest weights. We selected 2398 connections.
| Advantages | Drawbacks |
|---|---|
| 1. Easy to implement | 1. Arbitrary (but fitted to the data) |
| 2. Fast | 2. Doesn’t take account of the topology of the network |
| 3. Fitted to the data |
In the Zahoranszky-Kohalmi et al. paper, authors propose a method to determine the best threshold according to the topology of the network. This method calculates the Average Clustering Coefficient (ACC) for a whole network. The ACC represents a global parameter that characterizes the overall network topology.
As an Elbow approach, an ACC value is calculated for a range of thresholds. The obtained values are displayed and we choose the threshold that seems the best!
For help, we created two functions to calculate these ACC values and choose the best threshold based on the topology:
CCCalculation(): this function calculates the Clustering Coefficient (CC) for each nodeACCCalculation(); this function averages the CC for a network, in order to obtain the ACC## CC calculation function
CCCalculation <- function(node, graph){
#' Clustering Coefficient (CC) calculation
#'
#' Calculate the Clustering Coefficient (CC) for each node in a network
#'
#' @param node str.
#' @param graph igraph. Network object (e.g. the fused network object)
#'
#' @return Return the corresponding CC value
degNode <- degree(graph = graph, v = node, loops = FALSE)
if(degNode > 1){
neighborNames <- neighbors(graph = graph, v = node)
graph_s <- subgraph(graph = graph, vids = neighborNames)
neighborNb <- sum(degree(graph_s, loops = FALSE))
CC <- neighborNb / (degNode * (degNode-1))
}else{CC <- 0}
return(CC)
}
## ACC calculation function
ACCCalculation <- function(graph){
#' Average Clustering Coefficient (ACC) calculation
#'
#' It average the Clustering Coefficient (CC) of a network
#'
#' @param graph igraph. Network object (e.g. the fused network object)
#'
#' @return Return the corresponding ACC value
nodes <- V(graph)
ACC <- do.call(sum, lapply(nodes, CCCalculation, graph)) / length(nodes)
return(ACC)
}First, we determine a range of thresholds. The ACC precision increases with the number of chosen thresholds (choose at least 100).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.696e-05 4.389e-04 2.143e-03 3.623e-03 4.119e-03 1.414e-01
Calculate the ACC values for each selected threshold.
ACC_W <- do.call(rbind, lapply(thresholds, function(t, net){
net_sub <- subgraph.edges(net, E(net)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub)))
return(df)
}, tara_W_net))The ACC values are displayed in the Figure 7.3 (left). The number of network edges for each threshold is displayed in the Figure 7.3 (right).
plot(x = ACC_W$thresholds, y = ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = ACC_W$thresholds[1], y = ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = ACC_W$thresholds[21], y = ACC_W$ACC[21], col = "pink", pch = 16, cex = 1.2)
points(x = ACC_W$thresholds[29], y = ACC_W$ACC[29], col = "purple", pch = 16, cex = 1.2)
abline(v = ACC_W$thresholds[29], col = "purple")
text(ACC_W$thresholds[29], 0.7, pos = 4, paste0("Threshold = ", ACC_W$thresholds[29]), col = "purple")
text(ACC_W$thresholds[29], 0.6, pos = 4, paste0("ACCmax = ", round(ACC_W$ACC[29], 2)), col = "purple")
plot(x = ACC_W$thresholds, y = ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = ACC_W$thresholds[29], col = "purple")
text(ACC_W$thresholds[29], 2800, pos = 4, paste0("Threshold = ", ACC_W$thresholds[29]), col = "purple")
text(ACC_W$thresholds[29], 2000, pos = 4, paste0("ACCmax = ", round(ACC_W$ACC[29], 2)), col = "purple")
text(ACC_W$thresholds[29], 1200, pos = 4, paste0("EN = ", ACC_W[29, "EN"]), col = "purple")Figure 7.3: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
On the Figure 7.3 you can see a peak that stands out in comparison with the others.
| Advantages | Drawbacks |
|---|---|
| 1. Take account of the topology of the network | 1. Might be time consuming |
| 2. Interpretation (need to be used to this method) | |
| 3. Sometimes, no peak … |
The fused similarity network is already imported in Cytoscape. You already changed the network style and the network layout. If it is not, you can go to the previous Cytoscape section.
We have a fully connected network. To select edges based on their weights, use the following steps:
Figure 7.4: Filter edge weight using the determined threshold.
In the Filter tab:
Column FilterEdge: weight column nameFigure 7.5: Create new network visualization according the determined threshold
Selected Nodes, Selected EdgesNow, it’s your turn to find the best threshold!!
Practice:
Determine the threshold:
1.1. Select a threshold based on the weight distribution.
1.2. Use the median as threshold.
1.3. Determine the threshold based on the topology of the network.
Visualize these thresholds in Cytoscape.
Determine for each network that are in Cytoscape:
3.1. Number of edges and nodes.
3.2. Number of isolated samples.
For you, which threshold is better?
For each data type:
5.1. Create a network from the similarity matrix
5.2. Do the 1-4 steps
Keep in mind that you can change this threshold in Cytoscape anytime.
You integrated your data and created the corresponding fused similarity network: CONGRATULATIONS!!
This fused similarity network can be use for downstream analysis based on network’s algorithms such as clustering, retrieval or classification. You can also visualize the network and add some external data.
In this hands-on, we give you two examples to illustrate what you can do with the fused similarity network: clustering and visualization using Cytoscape.
In the SNF paper, authors propose a clustering method called spectralClustering(). This function takes in input three parameters:
affinity: fused similarity matrix WK: number of clusters we wanttype: the variants of spectral clustering to use (default 3)Samples are clustered together according to their similarity. The number of clusters needs to be defined. Obviously, this cluster number can be chosen using an Elbow approach, which would be less arbitrary.
First, define the number of clusters. Here, we define four clusters.
Then, perform the clustering. Results are stored into the group variable.
Next, merge the clustering results with the metadata.
## Row.names ocean depth Groups
## 1 TARA_004_DCM NAO DCM 1
## 2 TARA_004_SRF NAO SRF 1
## 3 TARA_007_DCM MS DCM 1
## 4 TARA_007_SRF MS SRF 1
## 5 TARA_009_DCM MS DCM 1
## 6 TARA_009_SRF MS SRF 1
Finally, save the merged data into a file.
write.table(dataGroups, "TARAOcean_4clusters.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") Practice:
You might merge clustering results to create on result file using merge() function.
As a reminder:
You already loaded the fused similarity network in Cytoscape. And you also removed the weak edges. To visualize the clustering results follow the steps:
Figure 8.1: Import table to the Network Collection.
*_clusters.txt to the Network Collection (information will be add to every network in the collection)Scan NetworkRefresh LegendFigure 8.2: Create a legend using the Legend Panel.
This is a network visualization example for each dataset:
Figure 8.3: Example of fused similarity network visualization using Cytoscape for each dataset.
Practice:
Now, you know how to integrate different type of data using the Similarity Network Fusion (SNF) method.
You’re aware of the important steps such as the preprocessing of the data (e.g. normalization and scaling, data shape), the similarity network creation and the fusion.
You also see how to perform clustering on the fused similarity network and how to visualize results.
This part is to go further in your analysis and your interpretation of the results.
At the beginning of the hands-on, we assumed that data have been already well normalized. We also used only the euclidean distance, whatever the types of data.
It could be interesting to try different normalization and distance calculation to choose the most-adapted preprocessing of the data.
For instance, the two main adapted normalization methods for the Tara Ocean data are:
Moreover, other distance distances exist and are more appropriate for ecological data than Euclidean distance:
These functions could help you: distance(), tss() or clr(). They are available in the ecodist, hilldiv and compositions packages.
Practice:
Did you notice the edge colors in the Figure 8.3? Colors represent the contribution of each data type in each sample association.
In the SNF paper, authors did like this:
Depending of the number of the integrated data type, we suggest two workflows:
Transform similarity network into data frames (to transform similarity matrix into similarity network see section 6.3.1).
tara_W_df <- as_data_frame(tara_W_net)
tara_nog_df <- as_data_frame(tara_nog_net)
tara_phy_df <- as_data_frame(tara_phy_net)Merge data frames into one data frame:
tmp <- merge(tara_nog_df, tara_phy_df, by = c(1,2), suffixes = c("_tara_nog", "_tara_phy"))
weights_df <- merge(tmp, tara_W_df, by = c(1,2))
head(weights_df)## from to weight_tara_nog weight_tara_phy weight
## 1 TARA_004_DCM TARA_138_DCM 5.941851e-04 2.261158e-04 0.0093268270
## 2 TARA_004_SRF TARA_004_DCM 1.514222e-03 2.973412e-04 0.0151933836
## 3 TARA_004_SRF TARA_133_MES 7.162040e-06 1.011284e-05 0.0001358067
## 4 TARA_004_SRF TARA_138_DCM 8.719842e-05 1.805816e-04 0.0028421617
## 5 TARA_007_DCM TARA_004_DCM 1.163166e-05 3.450973e-05 0.0030039624
## 6 TARA_007_DCM TARA_004_SRF 1.160152e-05 3.159833e-05 0.0025836564
Determine data type contribution on each edge:
sources_df <- as.data.frame(do.call(rbind, apply(weights_df, 1, function(s){
sources <- as.numeric(s[c(3,4)])
## Initiate keep vector with FALSE value
keep <- c(rep(FALSE, length(sources)))
## Where is the max value ?
max_ind <- which.max(sources)
## Keep it
max <- max(sources)
keep[max_ind] <- TRUE
## Retrieved indices of all non max value
tmp <- seq(1, length(sources))
tmp <- tmp[!tmp %in% max_ind]
## If max weight < 10% of other weight, put it to TRUE
for(i in tmp){
w <- as.numeric(sources[i])
if(max > w+(w*10/100)){keep[i] <- FALSE}
else{keep[i] <- TRUE}
}
## Select all the TRUE weights
## Compare them by pairs
## If w1 - w2 is > 10%, they are not selected anymore
## Except if one of them is the max value
keep_df <- as.data.frame(table(keep))
if(keep_df[keep_df$keep == TRUE, "Freq"] > 1){
ind <- which(keep == TRUE)
ind_comb <- combn(x = ind, m = 2)
for(i in seq(1:ncol(ind_comb))){
vector <- ind_comb[,i]
v1 <- as.numeric(sources[vector[1]])
v2 <- as.numeric(sources[vector[2]])
if(abs(v1 - v2) < (v1*10/100) && abs(v1 - v2) < (v2*10/100)){
keep[vector[1]] <- TRUE
keep[vector[2]] <- TRUE
}else{
keep[vector[1]] <- FALSE
keep[vector[2]] <- FALSE
keep[max_ind] <- TRUE
}
}
}
names(keep) <- names(s[c(3,4)])
final <- c(s[1], s[2], keep)
return(final)
}, simplify = FALSE)))Encode data type contribution to help the visualization:
sources_df$tara_nog <- ifelse(sources_df$tara_nog == TRUE, 1, 0)
sources_df$tara_phy <- ifelse(sources_df$tara_phy == TRUE, 2, 0)
sources_df$sum <- apply(sources_df[,c(3,4)], 1, sum)Write results into a file:
tara_W_df_withSource <- (merge(W_df, sources_df, by = c(1,2)))
write.table(tara_W_df_withSource, "TaraOcean_W_edgeList_withSource.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")Counts number of each combination of data:
Finally, load this file into Cytoscape and change the edge colors (see section 6.3.2 to import files and change node color).
Transform similarity network into data frames (to transform similarity matrix into similarity network see section 6.3.1).
tcga_W_df <- as_data_frame(tcga_W_net)
tcga_mirna_df <- as_data_frame(tcga_mirna_net)
tcga_mrna_df <- as_data_frame(tcga_mrna_net)
tcga_prot_df <- as_data_frame(tcga_prot_net)Merge data frames into one data frame:
tmp1 <- merge(tcga_mirna_df, tcga_mrna_df, by = c(1,2), suffixes = c("_tcga_miRNA", "_tcga_mRNA"))
tmp2 <- merge(tcga_prot_df, tcga_W_df, by = c(1, 2), suffixes = c("_tcga_prot", ""))
weights_df <- merge(tmp1, tmp2, by = c(1,2))
head(weights_df)## from to weight_tcga_miRNA weight_tcga_mRNA weight_tcga_prot weight
## 1 A03L A08O 3.664207e-05 4.127975e-04 1.376957e-03 0.007833912
## 2 A03L A0BS 9.365460e-05 8.151721e-05 3.543471e-04 0.003016587
## 3 A03L A0E1 7.392441e-05 3.924428e-04 4.981989e-04 0.004073567
## 4 A03L A0FS 1.971148e-05 1.154763e-04 5.335656e-04 0.001935859
## 5 A03L A0H7 2.216591e-04 3.549924e-04 4.890533e-05 0.005945000
## 6 A03L A0W4 6.924449e-05 2.586121e-04 8.181917e-04 0.002147077
Determine data type contribution for each edge:
sources_df <- as.data.frame(do.call(rbind, apply(weights_df, 1, function(s){
sources <- as.numeric(s[c(3,4,5)])
## Initiate keep vector with FALSE value
keep <- c(rep(FALSE, length(sources)))
## Where is the max value ?
max_ind <- which.max(sources)
## Keep it
max <- max(sources)
keep[max_ind] <- TRUE
## Retrieved indices of all non max value
tmp <- seq(1, length(sources))
tmp <- tmp[!tmp %in% max_ind]
## If max weight < 10% of other weight, put it to TRUE
for(i in tmp){
w <- as.numeric(sources[i])
if(max > w+(w*10/100)){keep[i] <- FALSE}
else{keep[i] <- TRUE}
}
## Select all the TRUE weights
## Compare them by pairs
## If w1 - w2 is > 10%, they are not selected anymore
## Except if one of them is the max valye
keep_df <- as.data.frame(table(keep))
if(keep_df[keep_df$keep == TRUE, "Freq"] > 1){
ind <- which(keep == TRUE)
ind_comb <- combn(x = ind, m = 2)
for(i in seq(1:ncol(ind_comb))){
vector <- ind_comb[,i]
v1 <- as.numeric(sources[vector[1]])
v2 <- as.numeric(sources[vector[2]])
if(abs(v1 - v2) < (v1*10/100) && abs(v1 - v2) < (v2*10/100)){
keep[vector[1]] <- TRUE
keep[vector[2]] <- TRUE
}else{
keep[vector[1]] <- FALSE
keep[vector[2]] <- FALSE
keep[max_ind] <- TRUE
}
}
}
names(keep) <- names(s[c(3,4,5)])
final <- c(s[1], s[2], keep)
return(final)
}, simplify = FALSE)))Encode data type contribution to help the visualization:
sources_df$weight_tcga_miRNA <- ifelse(sources_df$weight_tcga_miRNA == TRUE, 1, 0)
sources_df$weight_tcga_mRNA <- ifelse(sources_df$weight_tcga_mRNA == TRUE, 2, 0)
sources_df$weight_tcga_prot <- ifelse(sources_df$weight_tcga_prot == TRUE, 4, 0)
sources_df$sum <- apply(sources_df[,c(3,4,5)], 1, sum)Write results into a file:
W_df_withSource <- (merge(tcga_W_df, sources_df, by = c(1,2)))
write.table(W_df_withSource, "TCGA_W_edgeList_withSource.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t") Counts number of each combination of data:
## Var1 Freq
## 1 1 3051
## 2 2 2115
## 3 3 280
## 4 4 5090
## 5 5 317
## 6 6 289
## 7 7 33
Finally, load this file into Cytoscape and change the edge colors (see section 6.3.2 to import files and change node color).
Practice:
The metagenomic dataset from the Tara Ocean project contains 2 data types:
Data files are available in the summer school’s GitHub repository:
First, we load the data type from TARAoceans_proNOGS.cvs and TARAoceans_proPhylo.csv files.
The data file contains header (head = TRUE) and the first column contains row names (row.names = 1). Below, the first rows and columns are displayed.
tara_nog <- read.table(file = "../00_Data/TaraOcean_mibiomics/TARAoceans_proNOGS.csv", sep = ",", head = TRUE, row.names = 1)
tara_nog[c(1:5), c(1:5)]## NOG317682 NOG135470 NOG85325 NOG285859 NOG147792
## TARA_109_SRF 0 2.390962e-05 0 4.663604e-08 1.800215e-07
## TARA_149_MES 0 4.339824e-06 0 5.182915e-07 4.190123e-06
## TARA_110_MES 0 1.348252e-05 0 6.000043e-07 2.218342e-07
## TARA_102_MES 0 6.380711e-06 0 3.816016e-07 0.000000e+00
## TARA_142_SRF 0 9.484144e-06 0 6.437103e-08 1.132431e-06
Dimensions of the data.
## [1] 139 638
The nog data contain 139 samples (rows) and 638 features (columns). Data are in the right shape: samples in rows and features in columns.
The data file contains header (head = TRUE) and the first column contains row names (row.names = 1). Below, the first rows and columns are displayed.
tara_phy <- read.table(file = "../00_Data/TaraOcean_mibiomics/TARAoceans_proPhylo.csv", sep = ",", head = TRUE, row.names = 1)
tara_phy[c(1:5), c(1:3)]## EU638706.1.1353 JN537192.1.1500 CAFJ01000195.23.1515
## TARA_109_SRF 0 0 0
## TARA_149_MES 0 0 0
## TARA_110_MES 0 4 0
## TARA_102_MES 0 1 0
## TARA_142_SRF 0 3 0
Number of rows in the data.
## [1] 139
Number of columns in the data.
## [1] 356
The phy data contain 139 samples (rows) and 356 features (columns). Data are in the right shape: samples in rows and features in columns.
The TARAoceans_metadata.csv file contains metadata. It contains header (head = TRUE) and row names (row.names = 1).
tara_metadata <- read.table(file = "../00_Data/TaraOcean_mibiomics/TARAoceans_metadata.csv", sep = ",", head = TRUE, row.names = 1)
head(tara_metadata)## ocean depth
## TARA_109_SRF SPO SRF
## TARA_149_MES NAO MES
## TARA_110_MES SPO MES
## TARA_102_MES SPO MES
## TARA_142_SRF NAO SRF
## TARA_109_DCM SPO DCM
The metadata contains two types of information: ocean and depth.
## [1] "ocean" "depth"
Samples come from 8 oceans.
## [1] "SPO" "NAO" "IO" "SO" "SAO" "NPO" "RS" "MS"
Samples were collected in 4 different depths.
## [1] "SRF" "MES" "DCM" "MIX"
The data don’t contain missing values. We can go to the following steps.
##
## FALSE
## 88682
##
## FALSE
## 49484
We assume that data have been already prepared and normalized.
Nog data are scaled: each column will scaled to have the mean equals to zero and the standard deviation equals to one.
The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling.
hist(as.matrix(tara_nog), nclass = 100, main = "Orthologous genes - Prepared data", xlab = "values")
hist(tara_nog_scaled, nclass = 100, main = "Orthologous genes - Scaled data", xlab = "values")Before scaling, data values are almost all closed to zero. After scaling, data values seem to follow something close to a normal distribution. Data values are centered to zero.
Phy data are scaled: each column will scaled to have the mean equals to zero and the standard deviation equals to one.
The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling.
hist(as.matrix(tara_phy), nclass = 100, main = "Phylogenetic profile - Prepared data", xlab = "values")
hist(tara_phy_scaled, nclass = 100, main = "Phylogenetic profile - Scaled data", xlab = "values")This kind of data are sparse: there are a lot of zero values. Majority of the values are in the first range on the histogram. After scaling, data values seem to follow something close to a normal distribution. Data values are centered to zero.
In this part, we create the similarity network for each data type.
We calculate the Euclidean distance between each pair of samples for each type of data.
tara_nog_dist <- dist2(tara_nog_scaled, tara_nog_scaled)
tara_phy_dist <- dist2(tara_phy_scaled, tara_phy_scaled)The distance matrix dimensions are 139 rows and 139 columns. Indeed, we calculated pairwise distance, so the matrix contains samples in rows and in columns.
## [1] 139 139
The diagonal is composed of zero values (or values very closed). Indeed, there is no distance between the same sample.
## TARA_109_SRF TARA_149_MES TARA_110_MES TARA_102_MES TARA_142_SRF
## TARA_109_SRF 1.705303e-13 536.8095 329.8441 5.991295e+02 646.4804
## TARA_149_MES 5.368095e+02 0.0000 261.2395 3.325547e+02 738.4168
## TARA_110_MES 3.298441e+02 261.2395 0.0000 2.078134e+02 638.5818
## TARA_102_MES 5.991295e+02 332.5547 207.8134 2.273737e-13 877.6855
## TARA_142_SRF 6.464804e+02 738.4168 638.5818 8.776855e+02 0.0000
High distance values mean that samples are not similar. And small distance values mean that samples are similar.
The distance matrix is then transformed into similarity matrix for each data type. We set two parameters:
K = 20: number of nearest neighborssigma = 0.5: hyperparameterThe affinityMatrix() function transforms the distance into similarity according the distance with the nearest neighbors.
tara_nog_W <- affinityMatrix(tara_nog_dist, K, signma)
tara_phy_W <- affinityMatrix(tara_phy_dist, K, signma)The following figures are the heatmap of the similarity matrix (W) of each data type. The left heatmap are the nog data and the right heatmap are the phy data. Samples are clustered using hierarchical clustering. For a better visualization, we log-transform similarities.
pheatmap(log(tara_nog_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tara_metadata, main = "Orthologous genes - log10-transformed similarity values")
pheatmap(log(tara_phy_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tara_metadata, main = "Phylogenetic profil - log10-transformed similarity values")Red color means a high similarity value between two samples whereas blue color means a small similarity value between two samples.
The two heatmaps are different. Data seem to probably carry different type of information about the samples:
We created a similarity matrix for each data type. We saw that each network carries common information and its own information. Now, we will integrate all this information into only one fused similarity matrix.
We create the fused similarity matrix using these three parameters:
K = 20: number of the nearest neighborsT = 10: number of iterations## TARA_109_SRF TARA_149_MES TARA_110_MES TARA_102_MES TARA_142_SRF
## TARA_109_SRF 5.000000e-01 9.778842e-05 0.0002535809 0.0004410605 0.0019200171
## TARA_149_MES 9.778842e-05 5.000000e-01 0.0129882372 0.0126485110 0.0004977051
## TARA_110_MES 2.535809e-04 1.298824e-02 0.5000000000 0.0266185584 0.0029147232
## TARA_102_MES 4.410605e-04 1.264851e-02 0.0266185584 0.5000000000 0.0009179439
## TARA_142_SRF 1.920017e-03 4.977051e-04 0.0029147232 0.0009179439 0.5000000000
The dimension of the fused similarity matrix are 139 rows and 139 columns, such as the previous similarity matrices. The fused similarity matrix contains similarities between samples, we can also called them weights.
The fused similarity matrix contains 19321 weights.
## [1] 19321
The fused similarity matrix doesn’t contain zero:
## [1] 0
The following figure is the heatmap of the fused similarity matrix. Samples are automatically clustered with a hierarchical clustering. Weights are log-transformed for a better visualization.
pheatmap(log(tara_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tara_metadata, main = "Tara Ocean data - log10-transformed fused similarity matrix")Groups are more clearly defined in this fused similarity matrix. One corresponds to the depth MESO and other to a mix of DCM and SRF. The fused similarity matrix seems to be a mix of each similarity data type matrix.
Now, we create a fused similarity network from the fused similarity matrix. Self loops are remove (diag = FALSE) and only the upper values of the matrix are taken (mode = "upper", avoid duplicate information).
Then, the fused similarity network is saved into a the TaraOcean_W_edgeList.txt file:
write.table(as_data_frame(tara_W_net), "../02_Results/01_TaraOcean/TaraOcean_W_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")Figure 10.1: The fused similarity network of the Tara Ocean dataset.
According to Cytoscape, the network contains 139 samples (nodes) and 9591 connections (edges). The number of edges is smaller than in the similarity matrix because in the similarity matrix the weights are duplicates. The similarity matrix contains also the weights for each sample compare to itself (self loops).
For now, the network is fully connected: each sample is connected to every sample. Connections between samples are weights: some connections are strong (samples are similar) some other are weak (samples are not similar).
So in this section, we will choose a threshold to keep the strongest connections.
We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
tara_weights <- edge.attributes(tara_W_net)$weight
hist(tara_weights, nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights")
hist(log(tara_weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights")
abline(v = log(0.001, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: 0.001 With this threshold, we select 6440 connections.
We calculate the median of the weights.
## [1] 0.002143026
With the mean (0.002143) as threshold, we select 4796 connections.
## [1] 4796
Calculate the third quantile of the weights:
## 75%
## 0.004119391
With the third quantile (0.0041194) as threshold, we select 2398 connections.
## [1] 2398
In the following figure, we display the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(tara_weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "log10(weights)")
abline(v = log(tara_W_median, 10), col = "blue", lwd = 3)
text(log(tara_W_median, 10), 400, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(tara_W_q75, 10), col = "purple", lwd = 3)
text(log(tara_W_q75, 10), 400, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000570 0.0004456 0.0021613 0.0071942 0.0042193 0.5000000
We set the range between 0 and 0.0005.
## [1] 201
Then, we calculate the Average Clustering Coefficient for each threshold.
tara_W_ACC <- do.call(rbind, lapply(thresholds, function(t, net){
net_sub <- subgraph.edges(net, E(net)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub)))
return(df)
}, tara_W_net))Calculated values are displayed in the following figures.
plot(x = tara_W_ACC$thresholds, y = tara_W_ACC$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = tara_W_ACC$thresholds[1], y = tara_W_ACC$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tara_W_ACC$thresholds[21], y = tara_W_ACC$ACC[21], col = "pink", pch = 16, cex = 1.2)
points(x = tara_W_ACC$thresholds[29], y = tara_W_ACC$ACC[29], col = "purple", pch = 16, cex = 1.2)
abline(v = tara_W_ACC$thresholds[29], col = "purple")
text(tara_W_ACC$thresholds[29], 0.7, pos = 4, paste0("Threshold = ", tara_W_ACC$thresholds[29]), col = "purple")
text(tara_W_ACC$thresholds[29], 0.6, pos = 4, paste0("ACCmax = ", round(tara_W_ACC$ACC[29], 2)), col = "purple")
plot(x = tara_W_ACC$thresholds, y = tara_W_ACC$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = tara_W_ACC$thresholds[29], col = "purple")
text(tara_W_ACC$thresholds[29], 2800, pos = 4, paste0("Threshold = ", tara_W_ACC$thresholds[29]), col = "purple")
text(tara_W_ACC$thresholds[29], 2000, pos = 4, paste0("ACCmax = ", round(tara_W_ACC$ACC[29], 2)), col = "purple")
text(tara_W_ACC$thresholds[29], 1200, pos = 4, paste0("EN = ", tara_W_ACC[29, "EN"]), col = "purple")Figure 10.2: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
If we selected the purple local maxima, we will have 404 edges. It could be not enough edges. Let’s see during the visualization.
## [1] 0.014
The network visualization on the left was created with the third quantile (0.0041194). The network visualization on the right was created with the ACC method (0.014).
Figure 10.3: The fused network of the Tara Ocean dataset.
The left network contains lot of edges and it’s difficult to see clear connection between samples. Nevertheless, we can see two groups of samples. It could be interesting to color the nodes with the depth information.
This trend is also shows in the right network. Moreover, ocean samples seem to be groups together.
Network visualizations are available in the TARAocean_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
tara_nog_net <- graph_from_adjacency_matrix(tara_nog_W, weighted = TRUE, mode = "upper", diag = FALSE)
write.table(as_data_frame(tara_nog_net), "../02_Results/01_TaraOcean/TaraOcean_nog_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
tara_weights <- edge.attributes(tara_nog_net)$weight
hist(tara_weights, nclass = 100, main = "nog similarity network weight distribution", xlab = "weights")
hist(log(tara_weights, 10), nclass = 100, main = "nog similarity network weight distribution", xlab = "weights")
abline(v = log(1.584893e-05, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a kind of normal distribution. We would probably like to cut in the middle of the peak, or just before or after. If we cut in the middle, the corresponding weight is: 1.584893e-05. With this threshold, we select 6666 connections.
Calculate the median.
## [1] 2.876783e-05
Number of selected edges with the median as threshold.
## [1] 4796
Calculate the third quantile.
## 75%
## 8.505716e-05
Number of selected edges with the third quantile as threshold:
## [1] 2398
The following figures show where are these two threshold in the weight distribution.
hist(log(tara_weights, 10), nclass = 100, main = "nog weight distribution", xlab = "log10(weights)")
abline(v = log(tara_nog_median, 10), col = "blue", lwd = 3)
text(log(tara_nog_median, 10), 350, pos = 4, "Median", col = "blue", cex = 1)
abline(v = log(tara_nog_q75, 10), col = "purple", lwd = 3)
text(log(tara_nog_q75, 10), 250, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine the range of the threshold, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.822e-06 1.339e-05 2.877e-05 1.190e-04 8.506e-05 1.208e-02
We define the threshold range to try.
## [1] 161
Then, we calculate the Average Clustering Coefficient for each threshold.
tara_nog_ACC <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, tara_nog_net))## ACC
plot(x = tara_nog_ACC$thresholds, y = tara_nog_ACC$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC of orthologous gene data", type = "o")
points(x = tara_nog_ACC$thresholds[1], y = tara_nog_ACC$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tara_nog_ACC$thresholds[9], y = tara_nog_ACC$ACC[9], col = "pink", pch = 16, cex = 1.2)
points(x = tara_nog_ACC$thresholds[10], y = tara_nog_ACC$ACC[10], col = "purple", pch = 16, cex = 1.2)
abline(v = tara_nog_ACC$thresholds[10], col = "purple")
text(tara_nog_ACC$thresholds[10], 0.8, pos = 4, paste0("Threshold = ", tara_nog_ACC$thresholds[10]), col = "purple")
text(tara_nog_ACC$thresholds[10], 0.7, pos = 4, paste0("ACCmax = ", round(tara_nog_ACC$ACC[10], 2)), col = "purple")
## EN
plot(x = tara_nog_ACC$thresholds, y = tara_nog_ACC$EN, xlab = "thresholds", ylab = "number of edges", main = "Edge number of orthologous genes data", type = "o")
abline(v = tara_nog_ACC$thresholds[10], col = "purple")
text(tara_nog_ACC$thresholds[10], 2800, pos = 4, paste0("Threshold = ", tara_nog_ACC$thresholds[10]), col = "purple")
text(tara_nog_ACC$thresholds[10], 2000, pos = 4, paste0("ACCmax = ", round(tara_nog_ACC$ACC[10], 2)), col = "purple")
text(tara_nog_ACC$thresholds[10], 1200, pos = 4, paste0("EN = ", tara_nog_ACC[10, "EN"]), col = "purple")Figure 10.4: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
The local maxima threshold is:
## [1] 0.00045
And the number of selected egdes are:
## [1] 472
Network visualizations are available in the TARAocean_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
tara_phy_net <- graph_from_adjacency_matrix(tara_phy_W, weighted = TRUE, mode = "upper", diag = FALSE)
write.table(as_data_frame(tara_phy_net), "../02_Results/01_TaraOcean/TaraOcean_phy_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
tara_weights <- edge.attributes(tara_phy_net)$weight
hist(tara_phy_W, nclass = 100, main = "phy similarity network weight distribution", xlab = "weights")
hist(log(tara_weights, 10), nclass = 100, main = "phy similarity network weight distribution", xlab = "log10(weights)")
abline(v = log(3.162278e-05, 10), col = "cyan", lwd = 3)Number of connections:
## [1] 5188
Calculate the median.
## [1] 3.538224e-05
Number of selected edges with the median as threshold.
## [1] 4796
Calculate the third quantile.
## 75%
## 7.762902e-05
Number of selected edges with the third quantile as threshold.
## [1] 2398
The following figure show where are these two threshold in the weight distribution.
hist(log(tara_weights, 10), nclass = 100, main = "phy similarity network weight distribution", xlab = "log10(weights)")
abline(v = log(tara_phy_median, 10), col = "blue", lwd = 3)
text(log(tara_phy_median, 10), 370, pos = 4, "Median", col = "blue", cex = 1)
abline(v = log(tara_phy_q75, 10), col = "purple", lwd = 3)
text(log(tara_phy_q75, 10), 250, pos = 4, "quantile 75%", col = "purple", cex = 1)We define the threshold range to try.
## [1] 161
tara_phy_ACC <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, tara_phy_net))## ACC
plot(x = tara_phy_ACC$thresholds, y = tara_phy_ACC$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC of phylogenetic profil data", type = "o")
points(x = tara_phy_ACC$thresholds[1], y = tara_phy_ACC$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tara_phy_ACC$thresholds[17], y = tara_phy_ACC$ACC[17], col = "pink", pch = 16, cex = 1.2)
points(x = tara_phy_ACC$thresholds[22], y = tara_phy_ACC$ACC[22], col = "purple", pch = 16, cex = 1.2)
abline(v = tara_phy_ACC$thresholds[22], col = "purple")
text(tara_phy_ACC$thresholds[22], 0.9, pos = 4, paste0("Threshold = ", tara_phy_ACC$thresholds[22]), col = "purple")
text(tara_phy_ACC$thresholds[22], 0.85, pos = 4, paste0("ACCmax = ", round(tara_phy_ACC$ACC[22], 2)), col = "purple")
## EN
plot(x = tara_phy_ACC$thresholds, y = tara_phy_ACC$EN, xlab = "thresholds", ylab = "number of edges", main = "Edge number of phylogenetic profil data", type = "o")
abline(v = tara_phy_ACC$thresholds[22], col = "purple")
text(tara_phy_ACC$thresholds[22], 2800, pos = 4, paste0("Threshold = ", tara_phy_ACC$thresholds[22]), col = "purple")
text(tara_phy_ACC$thresholds[22], 2000, pos = 4, paste0("ACCmax = ", round(tara_phy_ACC$ACC[22], 2)), col = "purple")
text(tara_phy_ACC$thresholds[22], 1200, pos = 4, paste0("EN = ", tara_phy_ACC[22, "EN"]), col = "purple")Figure 10.5: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
The local maxima threshold is:
## [1] 0.00105
And the number of selected egdes are:
## [1] 120
Network visualizations are available in the TARAocean_cytoscape.cys file.
Samples are clustered together according to their similarity. According to our data and the information we have, we choose 4 and 8 clusters. Indeed, data are coming from four different depths and eight different oceans.
Results are saved into the same file.
These are two examples of network visualization for the Tara ocean dataset.
Figure 10.6: Left network: edge weights > 0.014 and node color according ocean. Right network: edge weights > 0.014 and node color according the depth.
We can see a high connected subnetwork on the left, connected to a sparser subnetwork on the right. The highly connected subnetwork corresponds to the MES samples (deeper layer) and the other subnetwork to DCM and SRF (surface layers).
Samples from the same ocean seem to be grouped together. We can’t see this stratification if we analysis one type of data alone.
The breast cancer dataset from The Cancer Genome Atlas (TCGA) contains 3 data types:
Data are available in the R package mixOmics. The metadata are also available in this package.
We load the breast cancer dataset:
The breast.TCGA object contains 3 types of data and one metadata:
## [1] "mirna" "mrna" "protein" "subtype"
Dimensions of the data are different:
## $mirna
## [1] 150 184
##
## $mrna
## [1] 150 200
##
## $protein
## [1] 150 142
##
## $subtype
## NULL
Data are extracted into single data frame:
tcga_mirna = breast.TCGA$data.train$mirna
tcga_mrna = breast.TCGA$data.train$mrna
tcga_prot = breast.TCGA$data.train$proteintcga_miRNA data contain 150 samples in rows and 184 features in columns.tcga_mRNA data contain 150 samples in rows and 200 features in columns.tcga_prot data contain 150 samples in rows and 142 features in columns.Data are already well shaped.
## RTN2 NDRG2 CCDC113 FAM63A ACADS
## A0FJ 4.362183 7.533461 3.956124 4.457170 2.256817
## A13E 1.984492 7.455194 5.427623 5.440957 4.028813
## A0G0 1.727323 8.079968 2.227300 5.543480 2.629855
## A0SX 4.363996 5.793750 3.544866 4.737114 4.269101
## A143 2.447562 7.158993 4.691256 4.808728 2.442135
We extract metadat from the breast.TCGA$data.train object.
The metadata contain the subtype of the breast cancer for each sample.
## [1] Basal Basal Basal Basal Basal Basal
## Levels: Basal Her2 LumA
For each subtype, there are 45, 30 and 75 samples:
## Basal Her2 LumA
## 45 30 75
Metadata should be stored in a data frame:
tcga_metadata_df <- data.frame("subtype" = tcga_metadata)
row.names(tcga_metadata_df) <- row.names(tcga_mirna)We save the metadata into a file. This file will be useful for the visualization.
Data don’t contain missing value. We can go to the following steps.
##
## FALSE
## 27600
##
## FALSE
## 30000
##
## FALSE
## 21300
We assume that data have been already prepared and normalized.
miRNA data are scaled: each column will scaled to have the mean equals to zero and the standard deviation equals to one.
The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling.
hist(tcga_mirna, nclass = 100, main = "TCGA miRNA data - Data distribution before scaling", xlab = "values")
hist(tcga_mirna_scaled, nclass = 100, main = "TCGA miRNA data - Data distribution after scaling", xlab = "scaled values")After scaling, data values seem to follow a normal distribution. Data values are centered to zero.
mrna data are scaled:
The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling.
hist(tcga_mrna, nclass = 100, main = "TCGA mRNA data - Data distribution before scaling", xlab = "values")
hist(tcga_mrna_scaled, nclass = 100, main = "TCGA mRNA data - Data distribution after scaling", xlab = "scaled values")After scaling, data values seem to follow a normal distribution. Data values are centered to zero.
Protein data are scaled:
The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling.
hist(tcga_prot, nclass = 100, main = "TCGA proteomic data - Data distribution before scaling", xlab = "values")
hist(tcga_prot_scaled, nclass = 100, main = "TCGA proteomic data - Data distribution after scaling", xlab = "scaled values")The protein data seem to be already scaled. So for the following steps, we will used tcga_prot variable.
In this part, we create the similarity network for each data type.
We calculate the Euclidean distance between each pair of samples for each type of data.
tcga_mirna_dist <- dist2(tcga_mirna_scaled, tcga_mirna_scaled)
tcga_mrna_dist <- dist2(tcga_mrna_scaled, tcga_mrna_scaled)
tcga_prot_dist <- dist2(tcga_prot, tcga_prot)Distance matrices have 150 rows and 150 columns. We calculated pairwise distance, so the matrix has samples in rows and in columns.
## [1] 150 150
The diagonal of the distance matrix contains the distance between sample and itself. So the distance is equal (or very close) to zero.
## A0FJ A13E A0G0 A0SX A143
## A0FJ 0.0000 3.150041e+02 271.9832 203.1437 513.4011
## A13E 315.0041 5.684342e-14 391.9542 291.6305 421.5542
## A0G0 271.9832 3.919542e+02 0.0000 243.1119 344.3791
## A0SX 203.1437 2.916305e+02 243.1119 0.0000 413.7339
## A143 513.4011 4.215542e+02 344.3791 413.7339 0.0000
High distance values mean that samples are not similar. And small distance values mean that samples are similar.
The distance matrix is then transformed into similarity matrix for each data type. We set two parameters:
K = 20: number of nearest neighborssignma = 0.5: hyperparameterThe affinityMatrix() function transforms the distance into similarity according the distance with the nearest neighbors.
tcga_mirna_W <- affinityMatrix(tcga_mirna_dist, K, sigma)
tcga_mrna_W <- affinityMatrix(tcga_mrna_dist, K, sigma)
tcga_prot_W <- affinityMatrix(tcga_prot_dist, K, sigma)The following figures are the heatmap of the similarity matrix (W) of each data type. Samples are clustered using hierarchical clustering. For a better visualization, we log-transform similarities.
pheatmap(log(tcga_mirna_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tcga_metadata_df, main = "TCGA miRNA data")
pheatmap(log(tcga_mrna_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tcga_metadata_df, main = "TCGA mRNA data")
pheatmap(log(tcga_prot_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tcga_metadata_df, main = "TCGA proteomic data")Figure 11.1: TCGA dataset - log-transformed similarity matrix heatmap
Red color means a high similarity value between two samples whereas blue color means a small similarity value between two samples.
Heatmaps are different between data types.
We created a similarity matrix for each data type. We saw that each network carries common information and its own information. Now, we will integrate all this information into only one fused similarity matrix.
We create the fused similarity matrix using these three parameters:
K = 20: number of nearest neighborsT = 10: number of iterationsK = 20
T = 10
tcga_W <- SNF(list(tcga_mirna_W, tcga_mrna_W, tcga_prot_W), K, T)
tcga_W[c(1:5), c(1:5)]## A0FJ A13E A0G0 A0SX A143
## A0FJ 0.500000000 0.014525731 0.015552881 0.013856749 0.006897882
## A13E 0.014525731 0.500000000 0.015887769 0.008857892 0.006463031
## A0G0 0.015552881 0.015887769 0.500000000 0.003744441 0.011143993
## A0SX 0.013856749 0.008857892 0.003744441 0.500000000 0.003083092
## A143 0.006897882 0.006463031 0.011143993 0.003083092 0.500000000
The dimensions of the fused network are 150 rows and 150 columns, such as the previous similarity matrices. The fused similarity matrix contains similarities between samples, we can also called them weights.
The fused similarity matrix contains 22500 weights.
## [1] 22500
The fused similarity matrix doesn’t contain zero.
##
## FALSE
## 22500
The following figure is the heatmap of the fused similarity matrix. Samples are automatically clustered with a hierarchical clustering. Weights are log-transformed for a better visualization.
pheatmap(log(tcga_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tcga_metadata_df, main = "TCGA - Fused similarity matrix W")Read color means a high similarity between samples. Blue color means a small similarity between samples.
In this heatmap, samples seem to be well clustered, according the cancer subtype. Basal samples are very different from LumA samples.
Now, we create a fused similarity network from the fused similarity matrix. Self loops are remove (diag = FALSE) and only the upper values of the matrix are taken (mode = "upper", avoid duplicate information).
Then, the fused similarity network is saved into a the TCGA_W_edgeList.txt file:
write.table(as_data_frame(tcga_W_net), "../02_Results/03_BreastTCGA/TCGA_W_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")Figure 11.2: Fused similarity network of the breast cancer dataset. Visualization using Cytoscape.
According Cytoscape, the fused similarity network contains 150 nodes (samples) and 11175 edges (connections) between samples. The connections number is smaller in Cytoscape. Indeed, in the similarity matrix weights are duplicates. The similarity matrix contains also the weights for each sample compare to itself (self loops).
For now, the fused similarity network is fully connected: each sample is connected to every other samples. Connections between samples are weighted: some connections are strong (samples are similar) and some other are weak (samples are not similar).
In this section, we will determine a threshold to select the strongest connections between samples.
We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
tcga_weights <- edge.attributes(tcga_W_net)$weight
hist(tcga_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = log(0.0039, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: 0.0039. With this threshold, we select 3198 connections.
Calculate the median of the weights:
## [1] 0.001811571
With the mean (0.0018116) as threshold, we select 5588 connections.
## [1] 5588
Calculate the third quantile of the weights:
## 75%
## 0.004639675
With the third quantile (0.0046397) as threshold, we select 2794 connections.
## [1] 2794
The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(tcga_W_median, 10), col = "blue", lwd = 3)
text(log(tcga_W_median, 10), 160, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(tcga_W_q75, 10), col = "purple", lwd = 3)
text(log(tcga_W_q75, 10), 160, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001333 0.0006946 0.0018320 0.0066667 0.0047588 0.5000000
We define a vector of threshold range to try (at least 100 values).
## [1] 101
Then, we calculate the Average Clustering Coefficient for each threshold.
tcga_ACC_W <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, tcga_W_net))Calculated values are displayed in the following figures.
## ACC
plot(x = tcga_ACC_W$thresholds, y = tcga_ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = tcga_ACC_W$thresholds[1], y = tcga_ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tcga_ACC_W$thresholds[55], y = tcga_ACC_W$ACC[55], col = "pink", pch = 16, cex = 1.2)
points(x = tcga_ACC_W$thresholds[58], y = tcga_ACC_W$ACC[58], col = "purple", pch = 16, cex = 1.2)
text(tcga_ACC_W$thresholds[58], 0.5, pos = 4, paste0("Threshold = ", tcga_ACC_W$thresholds[58]), col = "purple")
text(tcga_ACC_W$thresholds[58], 0.4, pos = 4, paste0("ACCmax = ", tcga_ACC_W$ACC[58]), col = "purple")
## EN
plot(x = tcga_ACC_W$thresholds, y = tcga_ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = tcga_ACC_W$thresholds[58], col = "purple")
text(tcga_ACC_W$thresholds[58], 6000, pos = 4, paste0("Threshold = ", tcga_ACC_W$thresholds[58]), col = "purple")
text(tcga_ACC_W$thresholds[58], 5000, pos = 4, paste0("ACCmax = ", tcga_ACC_W$ACC[58]), col = "purple")
text(tcga_ACC_W$thresholds[58], 4000, pos = 4, paste0("EN = ", tcga_ACC_W[58, "EN"]), col = "purple")Figure 11.3: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
We don’t have obvious and clear local maxima with this dataset.
If we selected the purple local maxima, we will have 598 edges. It could be not enough edges. Let’s see during the visualization.
## [1] 0.0114
The network visualization on the left was created with the third quantile (0.0046397). The network visualization on the right was created with the ACC method (0.0114).
The left network contains lot of edges and it’s difficult to see clear connection between samples. Nevertheless, we can see that LumA samples are not connected (very few edges) to the Basal samples.
This trend is also shows in the right network. Samples from the same cancer subtypes are connected together.
The two representations could be interesting. Network visualizations are available in the TCGA_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
tcga_mirna_net <- graph_from_adjacency_matrix(tcga_mirna_W, weighted = TRUE, mode = "upper", diag = FALSE)
write.table(as_data_frame(tcga_mirna_net), "../02_Results/03_BreastTCGA/TCGA_miRNA_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
tcga_weights <- edge.attributes(tcga_mirna_net)$weight
hist(tcga_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = log(0.0001, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a kind of normal distribution. We would probably like to cut in the middle of the peak, or just before or after. If we cut in the middle, the corresponding weight is: 0.0001. With this threshold, we select 5011 connections.
Calculate the median of the weights:
## [1] 8.421742e-05
Number of selected edges with the median as threshold.
## [1] 5588
Calculate the third quantile of the weights:
## 75%
## 0.0001979659
Number of selected edges with the third quantile as threshold:
## [1] 2794
The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(tcga_mirna_median, 10), col = "blue", lwd = 3)
text(log(tcga_mirna_median, 10), 400, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(tcga_mirna_q75, 10), col = "purple", lwd = 3)
text(log(tcga_mirna_q75, 10), 400, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.300e-07 3.072e-05 8.514e-05 2.002e-04 2.020e-04 1.077e-02
We define a vector of threshold range to try (at least 100 values).
## [1] 101
Then, we calculate the Average Clustering Coefficient for each threshold.
tcga_ACC_mirna <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, tcga_mirna_net))Calculated values are displayed in the following figures.
## ACC
plot(x = tcga_ACC_mirna$thresholds, y = tcga_ACC_mirna$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = tcga_ACC_mirna$thresholds[1], y = tcga_ACC_mirna$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tcga_ACC_mirna$thresholds[38], y = tcga_ACC_mirna$ACC[38], col = "pink", pch = 16, cex = 1.2)
points(x = tcga_ACC_mirna$thresholds[41], y = tcga_ACC_mirna$ACC[41], col = "purple", pch = 16, cex = 1.2)
text(tcga_ACC_mirna$thresholds[41], 0.5, pos = 4, paste0("Threshold = ", tcga_ACC_mirna$thresholds[41]), col = "purple")
text(tcga_ACC_mirna$thresholds[41], 0.4, pos = 4, paste0("ACCmax = ", tcga_ACC_mirna$ACC[41]), col = "purple")
## EN
plot(x = tcga_ACC_mirna$thresholds, y = tcga_ACC_mirna$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = tcga_ACC_mirna$thresholds[41], col = "purple")
text(tcga_ACC_mirna$thresholds[41], 6000, pos = 4, paste0("Threshold = ", tcga_ACC_mirna$thresholds[41]), col = "purple")
text(tcga_ACC_mirna$thresholds[41], 5000, pos = 4, paste0("ACCmax = ", tcga_ACC_mirna$ACC[41]), col = "purple")
text(tcga_ACC_mirna$thresholds[41], 4500, pos = 4, paste0("EN = ", tcga_ACC_mirna[41, "EN"]), col = "purple")Figure 11.4: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
The local maxima threshold is:
## [1] 8e-04
And the number of selected egdes are:
## [1] 249
Network visualizations are available in the TCGA_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
tcga_mrna_net <- graph_from_adjacency_matrix(tcga_mrna_W, weighted = TRUE, mode = "upper", diag = FALSE)
write.table(as_data_frame(tcga_mrna_net), "../02_Results/03_BreastTCGA/TCGA_mRNA_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
tcga_weights <- edge.attributes(tcga_mrna_net)$weight
hist(tcga_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = log(0.0001, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: 0.0001. With this threshold, we select 4702 connections.
Calculate the median of the weights:
## [1] 7.506899e-05
Number of selected edges with the median as threshold.
## [1] 5588
Calculate the third quantile of the weights:
## 75%
## 0.0001834747
Number of selected edges with the third quantile as threshold.
## [1] 2794
The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(tcga_mrna_median, 10), col = "blue", lwd = 3)
text(log(tcga_mrna_median, 10) - 0.2, 380, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(tcga_mrna_q75, 10), col = "purple", lwd = 3)
text(log(tcga_mrna_q75, 10), 380, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.007847 4.002122 5.172385 5.255571 6.389464 12.980562
We define a vector of threshold range to try (at least 100 values).
## [1] 126
Then, we calculate the Average Clustering Coefficient for each threshold.
tcga_ACC_mrna <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, tcga_mrna_net))Calculated values are displayed in the following figures.
## ACC
plot(x = tcga_ACC_mrna$thresholds, y = tcga_ACC_mrna$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = tcga_ACC_mrna$thresholds[1], y = tcga_ACC_mrna$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tcga_ACC_mrna$thresholds[54], y = tcga_ACC_mrna$ACC[54], col = "pink", pch = 16, cex = 1.2)
points(x = tcga_ACC_mrna$thresholds[56], y = tcga_ACC_mrna$ACC[56], col = "purple", pch = 16, cex = 1.2)
text(tcga_ACC_mrna$thresholds[56], 0.5, pos = 4, paste0("Threshold = ", tcga_ACC_mrna$thresholds[56]), col = "purple")
text(tcga_ACC_mrna$thresholds[56], 0.4, pos = 4, paste0("ACCmax = ", tcga_ACC_mrna$ACC[56]), col = "purple")
## EN
plot(x = tcga_ACC_mrna$thresholds, y = tcga_ACC_mrna$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = tcga_ACC_mrna$thresholds[56], col = "purple")
text(tcga_ACC_mrna$thresholds[56], 6000, pos = 4, paste0("Threshold = ", tcga_ACC_mrna$thresholds[56]), col = "purple")
text(tcga_ACC_mrna$thresholds[56], 5000, pos = 4, paste0("ACCmax = ", tcga_ACC_mrna$ACC[56]), col = "purple")
text(tcga_ACC_mrna$thresholds[56], 4500, pos = 4, paste0("EN = ", tcga_ACC_mrna[56, "EN"]), col = "purple")Figure 11.5: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
We don’t have obvious and clear local maxima with this dataset.
The local maxima threshold is:
## [1] 0.0011
And the number of selected egdes are:
## [1] 68
Network visualizations are available in the TCGA_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
tcga_prot_net <- graph_from_adjacency_matrix(tcga_prot_W, weighted = TRUE, mode = "upper", diag = FALSE)
write.table(as_data_frame(tcga_prot_net), "../02_Results/03_BreastTCGA/TCGA_prot_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
tcga_weights <- edge.attributes(tcga_prot_net)$weight
hist(tcga_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = log(7.943282e-05, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: 7.943282e-05. With this threshold, we select 6814 connections.
Calculate the median of the weights:
## [1] 0.0001414766
Number of selected edges with the median as threshold.
## [1] 5588
Calculate the third quantile of the weights:
## 75%
## 0.0005063203
Number of selected edges with the third quantile as threshold.
## [1] 2794
The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(tcga_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(tcga_prot_median, 10), col = "blue", lwd = 3)
text(log(tcga_prot_median, 10), 280, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(tcga_prot_q75, 10), col = "purple", lwd = 3)
text(log(tcga_prot_q75, 10), 260, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.98579 -0.22688 0.00000 0.03095 0.26711 6.63490
We define a vector of threshold range to try (at least 100 values).
## [1] 101
Then, we calculate the Average Clustering Coefficient for each threshold.
tcga_ACC_prot <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, tcga_prot_net))Calculated values are displayed in the following figures.
## ACC
plot(x = tcga_ACC_prot$thresholds, y = tcga_ACC_prot$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = tcga_ACC_prot$thresholds[1], y = tcga_ACC_prot$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tcga_ACC_prot$thresholds[36], y = tcga_ACC_prot$ACC[36], col = "pink", pch = 16, cex = 1.2)
points(x = tcga_ACC_prot$thresholds[39], y = tcga_ACC_prot$ACC[39], col = "purple", pch = 16, cex = 1.2)
text(tcga_ACC_prot$thresholds[39], 0.5, pos = 4, paste0("Threshold = ", tcga_ACC_prot$thresholds[39]), col = "purple")
text(tcga_ACC_prot$thresholds[39], 0.4, pos = 4, paste0("ACCmax = ", tcga_ACC_prot$ACC[39]), col = "purple")
## EN
plot(x = tcga_ACC_prot$thresholds, y = tcga_ACC_prot$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = tcga_ACC_prot$thresholds[39], col = "purple")
text(tcga_ACC_prot$thresholds[39], 6000, pos = 4, paste0("ACCmax = ", tcga_ACC_prot$thresholds[39]), col = "purple")
text(tcga_ACC_prot$thresholds[39], 5000, pos = 4, paste0("ACCmax = ", tcga_ACC_prot$ACC[39]), col = "purple")
text(tcga_ACC_prot$thresholds[39], 4500, pos = 4, paste0("EN = ", tcga_ACC_prot[39, "EN"]), col = "purple")Figure 11.6: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
The local maxima threshold is:
## [1] 0.0038
And the number of selected egdes are:
## [1] 186
Network visualizations are available in the TCGA_cytoscape.cys file.
Samples are clustered together according to their similarity.We know that there are three breast cancer subtypes in the dataset. So we decide to perform a clustering with three and four clusters.
Results are saved into the same file.
These are two examples of network visualization for the breast cancer dataset.
Figure 11.7: Left network: edge weights > 0.011 and node color according cancer subtypes. Right network: edge weights > 0.011 and node color according clustering results.
three clusters.We can see three interconnected groups. These groups are consistent with the cancer subtypes. We can also assign one cancer subtype per clusters, found by SNF.
Protein data support a lot of edges in this network. And miRNA and mRNA data seem to capture same kind of information (light green edges).
The Chronic Lymphocytic Leukaemia (CLL) dataset contains 4 data types:
Data are available in the R package MOFAdata. Metadata file is available in the summer school’s GitHub repository.
The CLL data are available in the MOFAdata R package.
The CLL_data object contains 4 types of data with different dimensions:
## $Drugs
## [1] 310 200
##
## $Methylation
## [1] 4248 200
##
## $mRNA
## [1] 5000 200
##
## $Mutations
## [1] 69 200
Rows are features (e.g. drug, genes) and columns are samples. There are 200 samples. We have to change the shape of the data.
Now, samples are in rows:
## ENSG00000244734 ENSG00000158528 ENSG00000198478 ENSG00000175445
## H045 4.558644 11.741854 8.921456 12.686458
## H109 2.721512 13.287432 2.721512 10.925985
## H024 9.938456 2.341006 12.381452 1.528848
## H056 13.278004 3.232874 8.106266 1.528848
## H079 6.086874 11.940820 4.889503 13.340588
## ENSG00000174469
## H045 2.644946
## H109 12.648355
## H024 1.528848
## H056 13.565210
## H079 5.476914
The sample_metadata.txt file contains metadata. It contains header (head = TRUE) and row names (row.names = 1).
CLL_metadata <- read.table("../00_Data/CLL/sample_metadata.txt", head = TRUE, sep = "\t", row.names = 1)
head(CLL_metadata)## Gender age TTT TTD treatedAfter died IGHV trisomy12
## H005 m 75.26575 0.57494867 2.625599 TRUE FALSE 1 0
## H006 m NA NA NA NA NA NA NA
## H007 f NA NA NA NA NA NA NA
## H008 m NA NA NA NA NA NA NA
## H010 f 72.78082 2.93223819 2.932238 FALSE FALSE 0 0
## H011 f 72.99452 0.01916496 2.951403 TRUE FALSE 1 0
We have information for each sample about:
## [1] "Gender" "age" "TTT" "TTD" "treatedAfter"
## [6] "died" "IGHV" "trisomy12"
For visualization, columns should be numerical, logical or character.
## 'data.frame': 200 obs. of 8 variables:
## $ Gender : chr "m" "m" "f" "m" ...
## $ age : num 75.3 NA NA NA 72.8 ...
## $ TTT : num 0.575 NA NA NA 2.932 ...
## $ TTD : num 2.63 NA NA NA 2.93 ...
## $ treatedAfter: logi TRUE NA NA NA FALSE TRUE ...
## $ died : logi FALSE NA NA NA FALSE FALSE ...
## $ IGHV : int 1 NA NA NA 0 1 0 0 0 0 ...
## $ trisomy12 : int 0 NA NA NA 0 0 0 0 0 1 ...
Overview of the drug data:
## D_001_1 D_001_2 D_001_3 D_001_4 D_001_5
## H045 0.02363938 0.04623274 0.3187471 0.8237027 0.8962777
## H109 0.07359900 0.10623002 0.2732891 0.7171379 0.8850003
## H024 NA NA NA NA NA
## H056 0.05813930 0.09022028 0.2322145 0.7225736 0.7957497
## H079 0.02042077 0.04750543 0.3638962 0.8073907 0.8794886
The CLL drug data contains missing data.
##
## FALSE TRUE
## 57040 4960
Overview of the methylation data:
## cg10146935 cg26837773 cg17801765 cg13244315 cg06181703
## H045 1.81108585 -5.1725723 5.4115263 -0.1188251 5.1203838
## H109 -3.99750846 1.5948702 5.4126925 1.0438706 1.2794803
## H024 -2.84431298 0.1611705 0.3657059 -4.2192362 0.7211004
## H056 -3.33865611 -2.0934326 0.3736342 -1.5921965 4.0470594
## H079 -0.01936203 3.7489796 5.4120096 1.4164183 5.2374225
The CLL methylation data contains missing data.
##
## FALSE TRUE
## 832608 16992
Overview of the mRNA data:
## ENSG00000244734 ENSG00000158528 ENSG00000198478 ENSG00000175445
## H045 4.558644 11.741854 8.921456 12.686458
## H109 2.721512 13.287432 2.721512 10.925985
## H024 9.938456 2.341006 12.381452 1.528848
## H056 13.278004 3.232874 8.106266 1.528848
## H079 6.086874 11.940820 4.889503 13.340588
## ENSG00000174469
## H045 2.644946
## H109 12.648355
## H024 1.528848
## H056 13.565210
## H079 5.476914
The CLL mRNA data contains missing data.
##
## FALSE TRUE
## 680000 320000
Overview of the mutation data:
## gain2p25.3 gain3q26 del6p21.2 del6q21 del8p12
## H045 0 0 0 0 0
## H109 0 0 0 0 0
## H024 0 0 0 0 0
## H056 0 0 0 0 0
## H079 1 0 0 0 0
The CLL mutation data contains missing data.
##
## FALSE TRUE
## 9141 4659
We remove samples with at least one missing data in each data type using the NARemoving() function. We set:
margin = 1 because samples are in rowthreshold = 0 because we don’t want missing data at all## [1] "Remove 16 samples."
## [1] "Remove 4 samples."
## [1] "Remove 64 samples."
## [1] "Remove 192 samples."
We decide to not use the mutation data because this data contains a lot of missing data. Data types need to have the same set of samples.
sampleNames <- Reduce(intersect, list(rownames(CLL_drug), rownames(CLL_meth), rownames(CLL_mrna)))
CLL_drug <- CLL_drug[rownames(CLL_drug) %in% sampleNames,]
CLL_meth <- CLL_meth[rownames(CLL_meth) %in% sampleNames,]
CLL_mrna <- CLL_mrna[rownames(CLL_mrna) %in% sampleNames,]
lapply(list("Drugs" = CLL_drug, "Meth" = CLL_meth, "mRNA" = CLL_mrna), dim)## $Drugs
## [1] 121 310
##
## $Meth
## [1] 121 4248
##
## $mRNA
## [1] 121 5000
We will run SNF with 121 samples and three different data types.
We assume that data have been already prepared and normalized.
Drug data are scaled. Each column will have the mean equals to zero and the standard deviation equals to one.
The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling.
hist(CLL_drug, nclass = 100, main = "CLL drug data - Data distribution before scaling", xlab = "values")
hist(CLL_drug_scaled, nclass = 100, main = "CLL drug data - Data distribution after scaling", xlab = "scaled values")After scaling, drug data follow a normal distribution.
Methylation data are scaled.
The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling.
hist(CLL_meth, nclass = 100, main = "CLL methylation data - Data distribution before scaling", xlab = "values")
hist(CLL_meth_scaled, nclass = 100, main = "CLL methylation data - Data distribution after scaling", xlab = "scaled values")Here, we can see more a binomial distribution after scaling. But, the data are centered.
mRNA data are scaled.
The following figures are the distribution of the data, before (left) and after (right) scaling. We expected a normal distribution of the data after scaling.
hist(CLL_mrna, nclass = 100, main = "CLL mRNA data - Data distribution before scaling", xlab = "values")
hist(CLL_mrna_scaled, nclass = 100, main = "CLL mRNA data - Data distribution after scaling", xlab = "scaled values")After scaling, mRNA data follow a normal distribution.
In this part, we create the similarity network for each data type.
We calculate the Euclidean distance between each pair of samples for each type of data.
CLL_drug_dist <- dist2(CLL_drug_scaled, CLL_drug_scaled)
CLL_meth_dist <- dist2(CLL_meth_scaled, CLL_meth_scaled)
CLL_mrna_dist <- dist2(CLL_mrna_scaled, CLL_mrna_scaled)Distance matrices have 121 rows (samples) and 121 columns (samples). We calculated pairwise distance, so the matrix has samples in rows and in columns.
## [1] 121 121
The diagonal of the distance matrix contains the distance between sample and itself. So the distance is equal (or very close) to zero.
## H045 H109 H056 H079 H164
## H045 2.273737e-13 4.028353e+02 1.115350e+03 340.5212 554.7438
## H109 4.028353e+02 2.273737e-13 1.074784e+03 671.3817 608.1489
## H056 1.115350e+03 1.074784e+03 1.136868e-13 729.6333 625.2601
## H079 3.405212e+02 6.713817e+02 7.296333e+02 0.0000 481.3647
## H164 5.547438e+02 6.081489e+02 6.252601e+02 481.3647 0.0000
High distance values mean that samples are not similar. And small distance values mean that samples are similar.
The distance values are transformed according the neighbors of the samples. We set two parameters:
K = 20: number of nearest neighborssigma = 0.5: hyperparameterThe affinityMatrix() function transforms the distance into similarity according the distance with the nearest neighbors.
CLL_drug_W <- affinityMatrix(CLL_drug_dist, K, sigma)
CLL_meth_W <- affinityMatrix(CLL_meth_dist, K, sigma)
CLL_mrna_W <- affinityMatrix(CLL_mrna_dist, K, sigma)The following figures are the heatmap of the similarity matrix (W) of each data type. Samples are clustered using hierarchical clustering. For a better visualization, we log-transform similarities.
pheatmap(log(CLL_drug_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = CLL_metadata[c(1, 2, 6, 7, 8)], main = "CLL Drugs")
pheatmap(log(CLL_meth_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = CLL_metadata[c(1, 2, 6, 7, 8)], main = "CLL Methylation")
pheatmap(log(CLL_mrna_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = CLL_metadata[c(1, 2, 6, 7, 8)], main = "CLL mRNA")Figure 12.1: CLL dataset - log-transformed similarity matrix heatmap
The red color means high similarity between two samples. The blue color means small similarity between two samples.
Heatmaps are different between data types. Each data seems to carry different information about samples. - drug data: the heatmap shows two groups of similar samples, but non of them seems to be related to a specific metadata. - methylation data: the heatmap shows two or maybe three groups of similar samples. Groups seem to be related to the IGHV status. - mRNA data: the heatmap shows small groups but not clear one.
We created a similarity matrix for each data type. We saw that each network carries common information and its own information. Now, we will integrate all this information into only one fused similarity matrix.
To create the fused similarity matrix, we set three parameters:
K = 20: number of nearest neighborsT = 10: number of iterations## H045 H109 H056 H079 H164
## H045 0.5000000000 0.0157070950 0.0008608448 0.0186693619 0.010246238
## H109 0.0157070950 0.5000000000 0.0009468912 0.0085345260 0.004235980
## H056 0.0008608448 0.0009468912 0.5000000000 0.0008622857 0.000796166
## H079 0.0186693619 0.0085345260 0.0008622857 0.5000000000 0.005578270
## H164 0.0102462384 0.0042359797 0.0007961660 0.0055782704 0.500000000
The dimensions of the fused network are 121 rows and 121 columns, such as the previous similarity matrices. The fused similarity matrix contains similarities between samples, we can also called them weights.
## [1] 121 121
The fused similarity matrix contains 14641 weights.
## [1] 14641
The fused similarity matrix doesn’t contain zero.
## [1] 0
The following figure is the heatmap of the fused similarity matrix. Samples are automatically clustered with a hierarchical clustering. Weights are log-transformed for a better visualization.
pheatmap(log(CLL_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = CLL_metadata[c(1, 2, 6, 7, 8)], main = "CLL - Fused similarity matrix W")The red color means high similarity between two samples. The blue color means small similarity between two samples.
The heatmap shows two main groups. Samples between groups are very different. Groups seem to be related to the IGHV status. This heatmap doesn’t give us obvious information. We can make several assumptions:
It could be interesting to try another distance and/or try with different parameters.
Now, we create a fused similarity network from the fused similarity matrix. Self loops are remove (diag = FALSE) and only the upper values of the matrix are taken (mode = "upper", avoid duplicate information).
Then, the fused similarity network is saved into a the CLL_W_edgeList.txt file:
write.table(as_data_frame(CLL_W_net), "../02_Results/02_CLL/CLL_W_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")This files is loaded into Cytoscape. The Figure ?? shows the fused similarity network of the Tara Ocean dataset.
According Cytoscape, the fused similarity network contains 121 nodes (samples) and 7260 edges (connections) between samples. The connections number is smaller in Cytoscape. Indeed, in the similarity matrix weights are duplicates. The similarity matrix contains also the weights for each sample compare to itself (self loops).
For now, the fused similarity network is fully connected: each sample is connected to every other samples. Connections between samples are weighted: some connections are strong (samples are similar) and some other are weak (samples are not similar).
In this section, we will determine a threshold to select the strongest connections between samples.
We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
CLL_weights <- edge.attributes(CLL_W_net)$weight
hist(CLL_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = log(0.004466836, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: 0.004466836. With this threshold, we select 2457 connections.
We calculate the median of the weights.
## [1] 0.002700864
With the mean (0.0027009) as threshold, we select 3630 connections.
## [1] 3630
Calculate the third quantile of the weights:
## 75%
## 0.005983555
With the third quantile (0.0059836) as threshold, we select 1815 connections.
## [1] 1815
The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(CLL_weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "log10(weights)")
abline(v = log(CLL_W_median, 10), col = "blue", lwd = 3)
text(log(CLL_W_median, 10), 160, pos = 4, "Median", col = "blue", cex = 1)
abline(v = log(CLL_W_q75, 10), col = "purple", lwd = 3)
text(log(CLL_W_q75, 10), 160, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0004991 0.0014801 0.0027246 0.0082645 0.0061004 0.5000000
We define a vector of threshold range to try (at least 100 values).
## [1] 101
Then, we calculate the Average Clustering Coefficient for each threshold.
CLL_ACC_W <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, CLL_W_net))Calculated values are displayed in the following figures.
## ACC
plot(x = CLL_ACC_W$thresholds, y = CLL_ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = CLL_ACC_W$thresholds[1], y = CLL_ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = CLL_ACC_W$thresholds[41], y = CLL_ACC_W$ACC[41], col = "pink", pch = 16, cex = 1.2)
points(x = CLL_ACC_W$thresholds[42], y = CLL_ACC_W$ACC[42], col = "purple", pch = 16, cex = 1.2)
text(CLL_ACC_W$thresholds[42], 0.5, pos = 4, paste0("Threshold = ", CLL_ACC_W$thresholds[42]), col = "purple")
text(CLL_ACC_W$thresholds[42], 0.4, pos = 4, paste0("ACCmax = ", CLL_ACC_W$ACC[42]), col = "purple")
## EN
plot(x = CLL_ACC_W$thresholds, y = CLL_ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = CLL_ACC_W$thresholds[42], col = "purple")
text(CLL_ACC_W$thresholds[42], 6000, pos = 4, paste0("Threshold = ", CLL_ACC_W$thresholds[42]), col = "purple")
text(CLL_ACC_W$thresholds[42], 5000, pos = 4, paste0("ACCmax = ", CLL_ACC_W$ACC[42]), col = "purple")
text(CLL_ACC_W$thresholds[42], 4500, pos = 4, paste0("EN = ", CLL_ACC_W[42, "EN"]), col = "purple")Figure 12.2: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
If we selected the purple local maxima, we will have 312 edges. It could be not enough edges. Let’s see during the visualization.
## [1] 0.0123
The network visualization on the left was created with the third quantile (0.0059836). The network visualization on the right was created with the ACC method (0.0123).
The left network contains lot of edges and it’s difficult to see clear connection between samples. Nevertheless, we can see two groups of samples, according the IGVH status.
This trend is also shows in the right network. Samples with same IGVH status are connected together.
It could be interesting to map other metadata on the network. Network visualizations are available in the CLL_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
CLL_drug_net <- graph_from_adjacency_matrix(CLL_drug_W, weighted = TRUE, mode = "upper", diag = FALSE)
write.table(as_data_frame(CLL_drug_net), "../02_Results/02_CLL/CLL_drug_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
CLL_weights <- edge.attributes(CLL_drug_net)$weight
hist(CLL_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = log(0.00007, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a binomial distribution. We would probably like to cut between the two peaks and choose the corresponding weight: 0.00007. With this threshold, we select 3210 connections.
We calculate the median of the weights.
## [1] 5.703429e-05
Number of selected edges with the median as threshold.
## [1] 3630
Calculate the third quantile of the weights:
## 75%
## 0.0001371145
Number of selected edges with the third quantile as threshold:
## [1] 1815
The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(CLL_drug_median, 10), col = "blue", lwd = 3)
text(log(CLL_drug_median, 10) - 0.5, 200, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(CLL_drug_q75, 10), col = "purple", lwd = 3)
text(log(CLL_drug_q75, 10), 230, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.650e-07 2.170e-05 5.803e-05 1.347e-04 1.394e-04 5.758e-03
We define a vector of threshold range to try (at least 100 values).
## [1] 151
Then, we calculate the Average Clustering Coefficient for each threshold.
CLL_ACC_drug <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, CLL_drug_net))Calculated values are displayed in the following figures.
## ACC
plot(x = CLL_ACC_drug$thresholds, y = CLL_ACC_drug$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = CLL_ACC_drug$thresholds[1], y = CLL_ACC_drug$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = CLL_ACC_drug$thresholds[39], y = CLL_ACC_drug$ACC[39], col = "pink", pch = 16, cex = 1.2)
points(x = CLL_ACC_drug$thresholds[45], y = CLL_ACC_drug$ACC[45], col = "purple", pch = 16, cex = 1.2)
text(CLL_ACC_drug$thresholds[45], 0.5, pos = 4, paste0("Threshold = ", CLL_ACC_drug$thresholds[45]), col = "purple")
text(CLL_ACC_drug$thresholds[45], 0.4, pos = 4, paste0("ACCmax = ", CLL_ACC_drug$ACC[45]), col = "purple")
## EN
plot(x = CLL_ACC_drug$thresholds, y = CLL_ACC_drug$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = CLL_ACC_drug$thresholds[42], col = "purple")
text(CLL_ACC_drug$thresholds[45], 5300, pos = 4, paste0("Threshold = ", CLL_ACC_drug$thresholds[45]), col = "purple")
text(CLL_ACC_drug$thresholds[45], 5000, pos = 4, paste0("ACCmax = ", CLL_ACC_drug$ACC[45]), col = "purple")
text(CLL_ACC_drug$thresholds[45], 4700, pos = 4, paste0("EN = ", CLL_ACC_drug[45, "EN"]), col = "purple")Figure 12.3: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
If we selected the purple local maxima, we will have 252 edges. It could be not enough edges. Let’s see during the visualization.
## [1] 0.00044
Network visualizations are available in the CLL_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
CLL_meth_net <- graph_from_adjacency_matrix(CLL_meth_W, weighted = TRUE, mode = "upper", diag = FALSE)
write.table(as_data_frame(CLL_meth_net), "../02_Results/02_CLL/CLL_meth_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
CLL_weights <- edge.attributes(CLL_meth_net)$weight
hist(CLL_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = log(1.258925e-05, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a kind of normal distribution. We probably would like to cut the peaks and choose the corresponding weight: 1.258925e-05. With this threshold, we select 1393 connections.
We calculate the median of the weights.
## [1] 7.647628e-06
Number of selected edges with the median as threshold.
## [1] 3630
Calculate the third quantile of the weights:
## 75%
## 1.159533e-05
Number of selected edges with the third quantile as threshold.
## [1] 1815
The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(CLL_meth_median, 10), col = "blue", lwd = 3)
text(log(CLL_meth_median, 10), 200, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(CLL_meth_q75, 10), col = "purple", lwd = 3)
text(log(CLL_meth_q75, 10), 230, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.822e-07 4.287e-06 7.720e-06 9.893e-06 1.169e-05 2.782e-04
We define a vector of threshold range to try (at least 100 values).
## [1] 121
Then, we calculate the Average Clustering Coefficient for each threshold.
CLL_ACC_meth <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, CLL_meth_net))Calculated values are displayed in the following figures.
## ACC
plot(x = CLL_ACC_meth$thresholds, y = CLL_ACC_meth$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = CLL_ACC_meth$thresholds[1], y = CLL_ACC_meth$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = CLL_ACC_meth$thresholds[41], y = CLL_ACC_meth$ACC[41], col = "pink", pch = 16, cex = 1.2)
points(x = CLL_ACC_meth$thresholds[42], y = CLL_ACC_meth$ACC[42], col = "purple", pch = 16, cex = 1.2)
text(CLL_ACC_meth$thresholds[42], 0.55, pos = 4, paste0("Threshold = ", CLL_ACC_meth$thresholds[42]), col = "purple")
text(CLL_ACC_meth$thresholds[42], 0.45, pos = 4, paste0("ACCmax = ", CLL_ACC_meth$ACC[42]), col = "purple")
## EN
plot(x = CLL_ACC_meth$thresholds, y = CLL_ACC_meth$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = CLL_ACC_meth$thresholds[42], col = "purple")
text(CLL_ACC_meth$thresholds[42], 5500, pos = 4, paste0("Threshold = ", CLL_ACC_meth$thresholds[42]), col = "purple")
text(CLL_ACC_meth$thresholds[42], 5000, pos = 4, paste0("ACCmax = ", CLL_ACC_meth$ACC[42]), col = "purple")
text(CLL_ACC_meth$thresholds[42], 4500, pos = 4, paste0("EN = ", CLL_ACC_meth[42, "EN"]), col = "purple")Figure 12.4: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
If we selected the purple local maxima, we will have 164 edges. It could be not enough edges. Let’s see during the visualization.
## [1] 2.05e-05
Network visualizations are available in the CLL_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
CLL_mrna_net <- graph_from_adjacency_matrix(CLL_mrna_W, weighted = TRUE, mode = "upper", diag = FALSE)
write.table(as_data_frame(CLL_mrna_net), "../02_Results/02_CLL/CLL_mrna_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
CLL_weights <- edge.attributes(CLL_mrna_net)$weight
hist(CLL_weights, nclass = 100, main = "Fused network weight distribution", xlab = "weights")
hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "weights")
abline(v = log(5.011872e-06, 10), col = "cyan", lwd = 3)The log-transformed weight distribution shows a kind of binomial distribution. We probably would like to cut between the two peaks and choose the corresponding weight: 5.011872e-06. With this threshold, we select 4013 connections.
We calculate the median of the weights.
## [1] 5.603111e-06
Number of selected edges with the median as threshold.
## [1] 3630
Calculate the third quantile of the weights:
## 75%
## 8.912064e-06
Number of selected edges with the third quantile as threshold.
## [1] 1815
The following figure displays the log-transformed weight distribution with the two previous calculated metrics as markers.
hist(log(CLL_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(CLL_mrna_median, 10), col = "blue", lwd = 3)
text(log(CLL_mrna_median, 10), 200, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(CLL_mrna_q75, 10), col = "purple", lwd = 3)
text(log(CLL_mrna_q75, 10), 190, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine a range of thresholds to try, we check the weights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.123e-07 3.351e-06 5.647e-06 8.180e-06 9.049e-06 2.305e-04
We define a vector of threshold range to try (at least 100 values).
## [1] 101
Then, we calculate the Average Clustering Coefficient for each threshold.
CLL_ACC_mrna <- do.call(rbind, lapply(thresholds, function(t, graph){
graph_sub <- subgraph.edges(graph, E(graph)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(graph_sub), "thresholds" = t, "EN" = length(E(graph_sub)))
return(df)
}, CLL_mrna_net))Calculated values are displayed in the following figures.
## ACC
plot(x = CLL_ACC_mrna$thresholds, y = CLL_ACC_mrna$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = CLL_ACC_mrna$thresholds[1], y = CLL_ACC_mrna$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = CLL_ACC_mrna$thresholds[46], y = CLL_ACC_mrna$ACC[46], col = "pink", pch = 16, cex = 1.2)
points(x = CLL_ACC_mrna$thresholds[48], y = CLL_ACC_mrna$ACC[48], col = "purple", pch = 16, cex = 1.2)
text(CLL_ACC_mrna$thresholds[48], 0.55, pos = 4, paste0("Threshold = ", CLL_ACC_mrna$thresholds[48]), col = "purple")
text(CLL_ACC_mrna$thresholds[48], 0.45, pos = 4, paste0("ACCmax = ", CLL_ACC_mrna$ACC[48]), col = "purple")
## EN
plot(x = CLL_ACC_mrna$thresholds, y = CLL_ACC_mrna$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = CLL_ACC_mrna$thresholds[48], col = "purple")
text(CLL_ACC_mrna$thresholds[48], 5500, pos = 4, paste0("Threshold = ", CLL_ACC_mrna$thresholds[48]), col = "purple")
text(CLL_ACC_mrna$thresholds[48], 5000, pos = 4, paste0("ACCmax = ", CLL_ACC_mrna$ACC[48]), col = "purple")
text(CLL_ACC_mrna$thresholds[48], 4500, pos = 4, paste0("EN = ", CLL_ACC_mrna[48, "EN"]), col = "purple")Figure 12.5: Left: Average Clustering Coeeficient (ACC) values for each threshold - Right: Number of edges on network for each threshold
If we selected the purple local maxima, we will have 229 edges. It could be not enough edges. Let’s see during the visualization.
## [1] 1.88e-05
Network visualizations are available in the CLL_cytoscape.cys file.
We decided to perform a clustering analysis with two and three clusters.
Then, results are save into the same file.
These are two examples of network visualization for the CLL dataset.
IGHV status is driving the clustering (right network). Inside this two groups, we can see a subnetwork that contains sample with trisomy12.
With this network, we can predicted the possible IGVH status of samples without this information.
The omic tomato plant dataset contains two omics data types:
Data files are available in the summer school’s GitHub repository.
First, we load the transcript data that are in the mrna.tsv file. This file contains header (head = TRUE) and the first column contains row names (row.names = 1). Below, the first columns and rows are displayed.
tomato_mrna <- read.table("../00_Data/Tomato/mrna.tsv", head = TRUE, row.names = 1)
tomato_mrna[c(1:5), c(1:5)]## s_1_1 s_1_2 s_1_3 s_2_1 s_2_2
## Solyc00g005050.2.1 -0.542 -0.431 -0.519 -0.128 -0.055
## Solyc00g006800.2.1 -0.243 -0.234 -0.165 0.060 0.177
## Solyc00g007270.2.1 -0.641 -0.738 -0.761 -0.331 -0.128
## Solyc00g009020.2.1 0.346 0.607 0.155 -0.131 -0.081
## Solyc00g011890.2.1 -0.318 -0.419 -0.510 -0.484 -0.479
Transcript data dimensions are nrow(tomato_mrna) rows and ncol(tomato_mrna):
## [1] 2375 27
Samples (2375) are in columns and transcripts (2375) are in rows. We need to transpose this matrix.
The transposed matrix dimensions are nrow(tomato_mrna_t) rows and ncol(tomato_mrna_t). Samples are in columns and transcripts are in rows.
## [1] 27 2375
Below, the first five rows and columns are displayed:
## Solyc00g005050.2.1 Solyc00g006800.2.1 Solyc00g007270.2.1
## s_1_1 -0.542 -0.243 -0.641
## s_1_2 -0.431 -0.234 -0.738
## s_1_3 -0.519 -0.165 -0.761
## s_2_1 -0.128 0.060 -0.331
## s_2_2 -0.055 0.177 -0.128
## Solyc00g009020.2.1 Solyc00g011890.2.1
## s_1_1 0.346 -0.318
## s_1_2 0.607 -0.419
## s_1_3 0.155 -0.510
## s_2_1 -0.131 -0.484
## s_2_2 -0.081 -0.479
Then, we load the protein data, available in the file prots.tsv. The file contains column heads (head = TRUE) and the first column contains row names (row.names = 1). Below, the first five columns and rows are displayed.
tomato_prot <- read.table("../00_Data/Tomato/prots.tsv", head = TRUE, row.names = 1)
tomato_prot[c(1:5), c(1:5)]## s_1_1 s_1_2 s_1_3 s_2_1 s_2_2
## Solyc00g005050.2.1 -0.025 0.695 0.270 -0.031 0.146
## Solyc00g006800.2.1 -0.391 -0.457 -0.003 0.217 -0.004
## Solyc00g007270.2.1 -1.469 -0.775 -1.311 -0.185 -0.061
## Solyc00g009020.2.1 0.562 0.493 0.895 0.525 0.471
## Solyc00g011890.2.1 0.063 -0.078 0.119 -0.346 -0.339
Protein data have 27 columns:
## [1] 27
Protein data have 2375 rows:
## [1] 2375
Rows are proteins and columns are samples. To continue the analysis, data need to have the samples in rows and features in columns. So, we transpose the protein data.
Now, protein data are in the right shape, as you can see below:
## Solyc00g005050.2.1 Solyc00g006800.2.1 Solyc00g007270.2.1
## s_1_1 -0.025 -0.391 -1.469
## s_1_2 0.695 -0.457 -0.775
## s_1_3 0.270 -0.003 -1.311
## s_2_1 -0.031 0.217 -0.185
## s_2_2 0.146 -0.004 -0.061
## Solyc00g009020.2.1 Solyc00g011890.2.1
## s_1_1 0.562 0.063
## s_1_2 0.493 -0.078
## s_1_3 0.895 0.119
## s_2_1 0.525 -0.346
## s_2_2 0.471 -0.339
Finally, we load the metadata that contain information about samples. Metadata are stored in the samples_metadata.csv file. This file is semicolon-separated (sep = ";") and contains column heads (head = TRUE). The first column is also row names (row.names = 1).
tomato_metadata <- read.table("../00_Data/Tomato/samples_metadata.csv", head = TRUE, row.names = 1, sep = ";")The metadata file contains information about:
## [1] "dpa" "growth_stage"
There are 20 different dpa:
## [1] 7 8 15 21 22 27 28 29 34 35 40 42 49 48 NA 50 51 54 53 52
There are three replicates per growth stage:
##
## GR1 GR2 GR3 GR4 GR5 GR6 GR7 GR8 GR9
## 3 3 3 3 3 3 3 3 3
Transcript data don’t contain missing value:
##
## FALSE
## 64125
Protein data don’t contain neither missing value:
##
## FALSE
## 64125
We can go to the following steps.
We assume that data habe been already prepared and normalized.
Transcript data are scaled: each column will scaled to have the mean equals to zero and the standard deviation equals to one.
Below, we show the data distribution before and after scaling. We expected to have a normal distribution of the data after scaling.
hist(tomato_mrna_t, nclass = 100, main = "Tomato fruit - Transcript data - Prepared data", xlab = "values")
hist(tara_phy_scaled, nclass = 100, main = "Tomato fruit - Transcript data - Scaled data", xlab = "values")Data seem to be already scaled. So we will use the transposed data tomato_mrna_t for the following analysis.
Protein data are scaled.
Below, we show the data distribution before and after scaling. We expected to have a normal distribution of the data after scaling.
hist(tomato_prot_t, nclass = 100, main = "Tomato fruit - Protein data - Prepared data", xlab = "values")
hist(tomato_prot_scaled, nclass = 100, main = "Tomato fruit - Protein data - Scaled data", xlab = "values")These data seem also already scaled. So we will use the transposed data tomato_prot_t for the following analysis.
In this part, we create the similarity network for each data type.
First, we calculate the Euclidean distance between each sample for each data type.
tomato_mrna_dist <- dist2(tomato_mrna_t, tomato_mrna_t)
tomato_prot_dist <- dist2(tomato_prot_t, tomato_prot_t)The created distance matrix dimensions are 27 rows and 27 columns. We calculated pairwise distance, so the matrix has samples in rows and in columns.
## [1] 27 27
The diagonal of the distance matrix contains the distance between sample and itself. There is distance between a sample and itself, so the distance is equal (or very close) to zero.
## s_1_1 s_1_2 s_1_3 s_2_1 s_2_2
## s_1_1 9.094947e-13 1.326255e+02 8.693254e+01 2.401490e+02 3.900077e+02
## s_1_2 1.326255e+02 6.821210e-13 2.801969e+02 3.532615e+02 4.989396e+02
## s_1_3 8.693254e+01 2.801969e+02 2.046363e-12 2.096528e+02 3.614704e+02
## s_2_1 2.401490e+02 3.532615e+02 2.096528e+02 1.023182e-12 4.759286e+01
## s_2_2 3.900077e+02 4.989396e+02 3.614704e+02 4.759286e+01 1.136868e-13
High distance values mean that samples are not similar. And small distance values mean that samples are similar.
The distance matrix is then transformed into similarity matrix for each data type. We set two parameters:
K = 20: number of nearest neighborssignma = 0.5: hyperparameterThe affinityMatrix() function transforms the distance into similarity according the distance with the nearest neighbors.
tomato_mrna_W <- affinityMatrix(tomato_mrna_dist, K = K, sigma = sigma)
tomato_prot_W <- affinityMatrix(tomato_prot_dist, K = K, sigma = sigma)The following figures are the heatmap of the similarity matrix (W) of each data type. The left heatmap are the mrna data and the right heatmap are the protein data. Samples are clustered using hierarchical clustering. For a better visualization, we log-transform similarities.
pheatmap(log(tomato_mrna_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tomato_metadata, main = "Transcript data - log10-transformed similarity values")
pheatmap(log(tomato_prot_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tomato_metadata, main = "Protein data - log10-transformed similarity values")Red color means a high similarity value between two samples whereas blue color means a small similarity value between two samples.
Heatmaps are different between the transcript and the protein data. Each data type carries different kind of sample information. The clustering shows that one group is clearly retrieve in both data type (samples in last dpa). Protein data seem to define better the development cycle of the tomato fruit.
We created a similarity matrix for each data type. We saw that each network carries common information and its own information. Now, we will integrate all this information into only one fused similarity matrix.
We create the fused similarity matrix using these three parameters:
K = 20: number of nearest neighborst = 10: number of iterationsThe dimension of the fused network are 27 rows and 27 columns, such as the previous similarity matrices. The fused similarity matrix contains similarities between samples, we can also called them weights.
The fused similarity network contains 729 weights.
## [1] 729
The fused similarity network doesn’t contain zero:
##
## FALSE
## 729
The following figure is the heatmap of the fused similarity matrix. Samples are automatically clustered with a hierarchical clustering. Weights are log-transformed for a better visualization.
pheatmap(log(tomato_W, 10), show_rownames = FALSE, show_colnames = FALSE, annotation = tomato_metadata, main = "Fused similarity matrix - log10-transformed similarity values")Read color means a high similarity between samples. Blue color means a small similarity between samples.
This heatmap seems to be a perfect mix between the two previous individual heatmaps. We still see two main groups: one with the last dpa and one other big with the other development stages. But, in this big group, now stages seem to be well grouped.
We create a fused similarity network from the fused similarity matrix. Self loops are remove (diag = FALSE) and only the upper values of the matrix are taken (mode = "upper", avoid duplicate information).
tomato_W_net <- graph_from_adjacency_matrix(tomato_W, diag = FALSE, mode = "upper", weighted = TRUE)Then, the fused similarity network is saved into a the Tomato_W_edgeList.txt file:
write.table(as_data_frame(tomato_W_net), "../02_Results/04_Tomato/Tomato_W_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")Figure 13.1: First Cytoscape visualization of the fused similarity network of Tomato fruit dataset.
According Cytoscape, the fused network contains 27 samples (nodes) and 351 connections (edges) between samples. The edge number is smaller in Cytoscape because we removed the self loop and took only half of the similarity matrix.
For now, the network is fully connected: each sample is connected to every sample. Connections between samples are weights: some connections are strong (samples are similar) some other are weak (samples are not similar).
So in this section, we will choose a threshold to keep the strongest connections.
We extract the weight to display the corresponding distribution to try to find a threshold. The distribution in the left is created using the raw weights. The distribution in the right is created using the log-transformed weights.
tomato_weights <- edge.attributes(tomato_W_net)$weight
hist(tomato_weights, nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights")
hist(log(tomato_weights, 10), nclass = 100, main = "Fused similarity network weight distribution", xlab = "weights")It’s not obvious how to choose the threshold with the weight distribution. Let’s see other methods.
Calculate the median:
## [1] 0.01292061
Number of selected edges with the median as threshold:
## [1] 176
Calculate the third quantile:
## 75%
## 0.0324337
Number of selected edges with the third quantile as threshold:
## [1] 88
The following figures show where are these two threshold in the weight distribution.
hist(log(tomato_weights, 10), nclass = 100, main = "Fused network weight distribution", xlab = "log10(weights)")
abline(v = log(tomato_W_median, 10), col = "blue", lwd = 3)
text(log(tomato_W_median, 10), 10, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(tomato_W_q75, 10), col = "purple", lwd = 3)
text(log(tomato_W_q75, 10), 12, pos = 4, "quantile 75%", col = "purple", cex = 1)To determine the range of the threshold, we check the weights:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001066 0.003292 0.012921 0.019231 0.032434 0.095629
We define the threshold range to try:
## [1] 192
Then, we calculate the Average Clustering Coefficient for each threshold.
tomato_ACC_W <- do.call(rbind, lapply(thresholds, function(t, net){
net_sub <- subgraph.edges(net, E(net)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub)))
return(df)
}, tomato_W_net))Calculated values are displayed in the following figures:
plot(x = tomato_ACC_W$thresholds, y = tomato_ACC_W$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Fused network W", type = "o")
points(x = tomato_ACC_W$thresholds[1], y = tomato_ACC_W$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tomato_ACC_W$thresholds[73], y = tomato_ACC_W$ACC[73], col = "pink", pch = 16, cex = 1.2)
points(x = tomato_ACC_W$thresholds[80], y = tomato_ACC_W$ACC[80], col = "purple", pch = 16, cex = 1.2)
points(x = tomato_ACC_W$thresholds[71], y = tomato_ACC_W$ACC[71], col = "cyan", pch = 16, cex = 1.2)
abline(v = tomato_ACC_W$thresholds[80], col = "purple")
text(tomato_ACC_W$thresholds[80], 0.5, pos = 4, paste0("Threshold = ", tomato_ACC_W$thresholds[80]), col = "purple")
text(tomato_ACC_W$thresholds[80], 0.4, pos = 4, paste0("ACCmax = ", tomato_ACC_W$ACC[80]), col = "purple")
plot(x = tomato_ACC_W$thresholds, y = tomato_ACC_W$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Fused network W", type = "o")
abline(v = tomato_ACC_W$thresholds[80], col = "purple")
text(tomato_ACC_W$thresholds[80], 350, pos = 4, paste0("Threshold = ", tomato_ACC_W$thresholds[80]), col = "purple")
text(tomato_ACC_W$thresholds[80], 300, pos = 4, paste0("ACCmax = ", tomato_ACC_W$ACC[80]), col = "purple")
text(tomato_ACC_W$thresholds[80], 250, pos = 4, paste0("EN = ", tomato_ACC_W[80, "EN"]), col = "purple")If we selected the purple local maxima, we will have 52 edges. It could be not enough edges.
## [1] 0.0395
We can try with another local maxima.
## [1] 0.035
## [1] 72
It could be interesting to try another threshold more.
## [1] 0.0085
## [1] 203
The following figures are filtered network using 0.035 (left) and 0.013 (right) as thresholds.
We think that the right network doesn’t have enough edges. We loose to much information between samples. We will probably use the network on the left for the following visualization.
Network visualizations are available in the Tomato_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
tomato_mrna_net <- graph_from_adjacency_matrix(tomato_mrna_W, diag = FALSE, mode = "upper", weighted = TRUE)
write.table(as_data_frame(tomato_mrna_net), "../02_Results/04_Tomato/Tomato_mrna_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")It’s not obvious how to choose the threshold with the weight distribution. Let’s see other methods.
tomato_weights <- edge.attributes(tomato_mrna_net)$weight
hist(tomato_weights, nclass = 100, main = "Transcript weight distribution", xlab = "weights")
hist(log(tomato_weights, 10), nclass = 100, main = "Transcript weight distribution", xlab = "weights")Calculate the median:
## [1] 6.307718e-05
Number of selected edges with the median as threshold:
## [1] 176
Calculate the third quantile:
## 75%
## 0.0006490944
Number of selected edges with the third quantile as threshold:
## [1] 88
The following figures show where are these two threshold in the weight distribution.
hist(log(tomato_weights, 10), nclass = 100, main = "Transcript weight distribution", xlab = "log10(weights)")
abline(v = log(tomato_mrna_median, 10), col = "blue", lwd = 3)
text(log(tomato_mrna_median, 10), 10, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(tomato_mrna_q75, 10), col = "purple", lwd = 3)
text(log(tomato_mrna_q75, 10), 12, pos = 2, "quantile 75%", col = "purple", cex = 1)To determine the range of the threshold, we check the weights:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.571e-06 8.678e-06 6.308e-05 3.722e-04 6.491e-04 2.522e-03
We define the threshold range to try:
## [1] 127
Then, we calculate the Average Clustering Coefficient for each threshold.
tomato_ACC_mrna <- do.call(rbind, lapply(thresholds, function(t, net){
net_sub <- subgraph.edges(net, E(net)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub)))
return(df)
}, tomato_mrna_net))Calculated values are displayed in the following figures:
plot(x = tomato_ACC_mrna$thresholds, y = tomato_ACC_mrna$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the Transcript W", type = "o")
points(x = tomato_ACC_mrna$thresholds[1], y = tomato_ACC_mrna$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tomato_ACC_mrna$thresholds[36], y = tomato_ACC_mrna$ACC[36], col = "pink", pch = 16, cex = 1.2)
points(x = tomato_ACC_mrna$thresholds[37], y = tomato_ACC_mrna$ACC[37], col = "purple", pch = 16, cex = 1.2)
abline(v = tomato_ACC_mrna$thresholds[37], col = "purple")
text(tomato_ACC_mrna$thresholds[37], 0.5, pos = 2, paste0("Threshold = ", tomato_ACC_mrna$thresholds[37]), col = "purple")
text(tomato_ACC_mrna$thresholds[37], 0.4, pos = 2, paste0("ACCmax = ", tomato_ACC_mrna$ACC[37]), col = "purple")
plot(x = tomato_ACC_mrna$thresholds, y = tomato_ACC_mrna$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the Transcript W", type = "o")
abline(v = tomato_ACC_mrna$thresholds[37], col = "purple")
text(tomato_ACC_mrna$thresholds[37], 350, pos = 4, paste0("Threshold = ", tomato_ACC_mrna$thresholds[37]), col = "purple")
text(tomato_ACC_mrna$thresholds[37], 300, pos = 4, paste0("ACCmax = ", tomato_ACC_mrna$ACC[37]), col = "purple")
text(tomato_ACC_mrna$thresholds[37], 250, pos = 4, paste0("EN = ", tomato_ACC_mrna[37, "EN"]), col = "purple")The local maxima threshold is:
## [1] 0.00072
And the number of selected egdes are:
## [1] 77
Network visualizations are available in the Tomato_cytoscape.cys file.
Similarity matrix is transformed to a similarity network and saved into a file. We still keep only half (mode = "upper") of the similarity matrix (avoid redundancies) and remove the self loop (diag = FALSE).
tomato_prot_net <- graph_from_adjacency_matrix(tomato_prot_W, diag = FALSE, mode = "upper", weighted = TRUE)
write.table(as_data_frame(tomato_prot_net), "../02_Results/04_Tomato/Tomato_prot_edgeList.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")It’s not obvious how to choose the threshold with the weight distribution. Let’s see other methods.
tomato_weights <- edge.attributes(tomato_prot_net)$weight
hist(tomato_weights, nclass = 100, main = "Protein weight distribution", xlab = "weights")
hist(log(tomato_weights, 10), nclass = 100, main = "Protein weight distribution", xlab = "weights")Calculate the median:
## [1] 7.254911e-05
Number of selected edges with the median as threshold:
## [1] 176
Calculate the third quantile:
## 75%
## 0.0004788178
Number of selected edges with the third quantile as threshold:
## [1] 88
The following figures show where are these two threshold in the weight distribution.
hist(log(tomato_weights, 10), nclass = 100, main = "Protein weight distribution", xlab = "log10(weights)")
abline(v = log(tomato_mrna_median, 10), col = "blue", lwd = 3)
text(log(tomato_mrna_median, 10), 12, pos = 2, "Median", col = "blue", cex = 1)
abline(v = log(tomato_mrna_q75, 10), col = "purple", lwd = 3)
text(log(tomato_mrna_q75, 10), 12, pos = 2, "quantile 75%", col = "purple", cex = 1)To determine the range of the threshold, we check the weights:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.412e-06 1.379e-05 7.255e-05 3.042e-04 4.788e-04 2.169e-03
We define the threshold range to try:
## [1] 101
Then, we calculate the Average Clustering Coefficient for each threshold.
tomato_ACC_prot <- do.call(rbind, lapply(thresholds, function(t, net){
net_sub <- subgraph.edges(net, E(net)[weight >= t])
df <- data.frame("ACC" = ACCCalculation(net_sub), "thresholds" = t, "EN" = length(E(net_sub)))
return(df)
}, tomato_prot_net))Calculated values are displayed in the following figures:
plot(x = tomato_ACC_prot$thresholds, y = tomato_ACC_prot$ACC, xlab = "thresholds", ylab = "ACC", main = "ACC calculation of the protein W", type = "o")
points(x = tomato_ACC_prot$thresholds[1], y = tomato_ACC_prot$ACC[1], col = "red", pch = 16, cex = 1.2)
points(x = tomato_ACC_prot$thresholds[7], y = tomato_ACC_prot$ACC[7], col = "pink", pch = 16, cex = 1.2)
points(x = tomato_ACC_prot$thresholds[9], y = tomato_ACC_prot$ACC[9], col = "purple", pch = 16, cex = 1.2)
abline(v = tomato_ACC_prot$thresholds[9], col = "purple")
text(tomato_ACC_prot$thresholds[9], 0.8, pos = 4, paste0("Threshold = ", tomato_ACC_prot$thresholds[9]), col = "purple")
text(tomato_ACC_prot$thresholds[9], 0.7, pos = 4, paste0("ACCmax = ", tomato_ACC_prot$ACC[9]), col = "purple")
plot(x = tomato_ACC_prot$thresholds, y = tomato_ACC_prot$EN, xlab = "thresholds", ylab = "number of edges", main = "EN of the protein W", type = "o")
abline(v = tomato_ACC_prot$thresholds[9], col = "purple")
text(tomato_ACC_prot$thresholds[9], 350, pos = 4, paste0("Threshold = ", tomato_ACC_prot$thresholds[9]), col = "purple")
text(tomato_ACC_prot$thresholds[9], 300, pos = 4, paste0("ACCmax = ", tomato_ACC_prot$ACC[9]), col = "purple")
text(tomato_ACC_prot$thresholds[9], 250, pos = 4, paste0("EN = ", tomato_ACC_prot[9, "EN"]), col = "purple")The local maxima threshold is:
## [1] 8e-05
And the number of selected egdes are:
## [1] 170
Network visualizations are available in the Tomato_cytoscape.cys file.
In the Belouah et al. paper, they define three development stages. According that, we will run a clustering with three clusters.
C <- 3
group <- data.frame(Groups = spectralClustering(tomato_W, C))
row.names(group) <- colnames(tomato_W)
dataGroups <- merge(tomato_metadata, group, by = 0)
head(dataGroups)## Row.names dpa growth_stage Groups
## 1 s_1_1 7 GR1 3
## 2 s_1_2 8 GR1 3
## 3 s_1_3 8 GR1 3
## 4 s_2_1 15 GR2 3
## 5 s_2_2 15 GR2 3
## 6 s_2_3 15 GR2 3
Then, we save the results into a result file:
These are two examples of network visualization for the tomato fruit dataset.
Figure 13.2: Left network: edge weights > 0.013. Right network: edge weights > 0.032.
three clusters.Overall, we see two groups of nodes: one with the late growth stages (GR7, GR8 and GR9) and one other with the early growth stages (GR1-GR6). This two groups seem to be very different because there are few connections between them. We already saw these two groups with the individual heatmaps.
In the Belouah et al. paper, the three last growth state correspond to the ripening stage (appearance of fruit color). With the clustering, we also detect the three stage level that they describe: early, mid and late stages of fruit development.
With the network visualization, we can see a kind of kinetic from the early stage (GR1 only connected to GR2) to the mid stage. We can also see that the protein data type carries the most information in this network, inside the groups.